# ___

# [ Machine Learning in Geosciences ]

**Department of Applied Geoinformatics and Carthography, Charles University** 

*Lukas Brodsky lukas.brodsky@natur.cuni.cz*

    
___



# Exercise: End-to-End Machine Learning Project!

Develop the following steps for the given data set and described problem. 

1. Task description 
2. Exploratory Data Analysis
3. Data preparation 
4. Select and train model 
5. Model fine-tuning 
6. Results interpretation

___    

# 1. Task description - the Problem 

**Problem Statement** 
The goal is to develop a predictive model for housing prices in California using census data. The model should accurately estimate the median housing price for any given district based on relevant socioeconomic and geographical factors.

**Objective:**
The primary objective is to build a machine learning model that can generalize well across different regions in California. The model should leverage available census data to capture key patterns influencing housing prices, such as location, population density, income levels, and housing characteristics.

**Assumptions of the problem:** 
* There exist some (most likely non-linear) relationship between input features (X) and the output target  variable (y); 
* the output target is a continuous variable, hence we employ regression type of model; 
* There are multiple features, hence multivariate regression; 
* There is no continuous flow of data -> batch learning;


**The expected result:** 
The developed model shall predict housing prices based set of characteristics (fetures) with error < 20 %. 


**Data Description**
The dataset consists of housing-related and demographic attributes for various districts in California, as recorded in the U.S. census. Key features include:

- Median income of district residents
- Household population and housing unit counts
- Geographical coordinates (latitude, longitude)
- Proximity to the ocean (e.g., inland vs. coastal regions)
- Median house age and total rooms/bedrooms per household
- Median house value (target variable)

**Challenges:**
- Feature Engineering: Handling missing values, categorical encoding (e.g., proximity to ocean), and outlier detection.
- Non-linearity & Feature Interactions: The relationship between predictors and housing prices may be complex and require non-linear modeling approaches.
- Geospatial Variability: Housing prices vary significantly based on location, requiring spatial awareness in the model.
- Data Imbalance: Certain high-value or low-value districts may be underrepresented, leading to biased predictions.
- Generalization: Ensuring the model performs well on new, unseen districts rather than overfitting to training data.

**Strengths:**
- Rich Dataset: The dataset contains diverse attributes that provide valuable insights into housing price drivers.
- Geospatial Information: Latitude and longitude enable the inclusion of spatial dependencies.
- Potential for Feature Engineering: Additional features like population density or room-to-household ratios can improve predictive performance.

**Performance Measures:** 
- **Mean Absolute Error (MAE)**: Measures the average absolute difference between actual and predicted prices, providing an intuitive interpretation.
- Root Mean Squared Error (RMSE): Penalizes larger errors more than MAE, useful for emphasizing significant prediction mistakes.
- R² Score (Coefficient of Determination): Evaluates how well the model explains the variance in housing prices.
- Mean Absolute Percentage Error (MAPE): Expresses prediction error as a percentage, making it useful for relative performance comparisons.

In [19]:
# Common imports
import numpy as np
import pandas as pd
import os
 
# TODO: add more 

np.random.seed(42)

# 2. Exploratory Data Analysis (EDA) 


- Load the data 
- Take a quick look at the data structure!
- Visualize the data to gain insights 
- Look for correlations

In [None]:
# Data
data_path = './data/housing.csv'

In [None]:
pass

# 3/ Prepare the data for Machine Learning algorithms

- New fetures (rooms_per_household, bedrooms_per_room, population_per_household) 
- Fill no-data values or drop incomplete records
- Harmonize numerical data - Feature Endcoding
- Split Data set into Training and Test Sets (Try Stratified sampling - Is there a balnce?) 


In [None]:
pass 

# 4. Select and train a model 

- Try linear model
- Apply non-linear model 
- Evaluate Overfitting


In [None]:
pass 

# 5. Model Fine-Tuning

- Fix Underfitting / Overfitting 
- Grid Search vs. Randomized Search
- Apply Stratified K-Fold Cross-Validation


In [None]:
pass

# 6. Results Interpretation

- Model Performance Insights & Interpretation
- Feature Importance & Model Explainability
- Spatial & Temporal Analysis of Predictions - plot predictions 
- Compare with Baseline Modele


In [None]:
pass 