# Task 1: Pattern Recognition

For this task, you will use the following 4 CSV files:

1. **customers_consumptions.csv**  
2. **customers_metadata.csv**  
3. **weather_data.csv**  
4. **price_data.csv**

---

## Input Data

### **customers_consumptions.csv**  
This file contains consumption data of a subset of our customers. Key columns include:  
- **meteringpoint_id**: Unique customer identifier.  
- **validfrom**: Timestamp marking the start of the 15-minute measurement interval.  
- **quantity**: Electricity consumed (in kWh) during the 15-minute interval.  

The data spans varying time intervals for different customers.

---

### **customers_metadata.csv**  
This file maps customers to their nearest weather station:  
- **meteringpoint_id**: Unique customer identifier.  
- **weatherstation_id**: Closest weather station to the customer.

---

### **weather_data.csv**  
Weather station measurements include:  
- **validfrom**: Timestamp of the weather measurement.  
- **air_temp**: Air temperature in degrees Celsius.  
- **ghi**: Global horizontal irradiation (sum of solar energy per unit area).  
- **cloud_opacity**: Cloud opacity percentage (0% = clear, 100% = opaque).  
- **precipitable_water**: Water vapor amount in \( kg/m^2 \).

---

### **price_data.csv**  
This file contains electricity market prices sampled hourly:  
- **timestamp**: Start of the 1-hour interval.  
- **price**: Electricity price for that hour (constant across Austria).

---

## Problem Description

Your goal is to **cluster customers into two groups** based on their **consumption behavior**. While we have an idea of a useful grouping, you are encouraged to experiment and propose novel approaches. Consider the following:  
- Patterns in consumption over time (e.g., daily or seasonal trends).  
- Relationships between consumption and external factors, such as weather conditions and electricity prices.

For example, a lot of our customers have electric vehicles which they charge when the prices are low. Others have home devices that can track electricity prices, and get activated when the price is low. These types of customers should exhibit highly price related behaviour.

We welcome methods ranging from **simple correlation analysis** to **advanced machine learning techniques**.

---

## Deliverables

1. **Output**  
   - Provide a file with customer groupings in the following format:  

     | meteringpoint_id | cluster |  
     |------------------|---------|  
     | 12345           | 0       |  
     | 67890           | 1       |  

     Here, the **cluster** column is binary (0 or 1).

2. **Methodology Description**  
   - Describe your thought process, clustering approach, and assumptions. This can be in the form of:  
     - Code comments.  
     - A short report or markdown file.

---

## Additional Notes
 
- You are free to use libraries such as **pandas**, **scikit-learn**, or any other tools suitable for analysis.  
- Creativity and interpretability are key—feel free to justify and visualize your results as you see fit.  


In [None]:
# Write your code here

# Task 2: Forecasting

For this task, all of the data is provided within the **average_consumption.csv** file.

---

## Input Data

The **average_consumption.csv** file contains the following columns:  
- **validfrom**: Timestamp of the 15-minute measurement interval (valid for the next 15 minutes).  
- **avg**: The target/dependent variable – average electricity consumption of a subset of customers (in kWh).  
- **Other columns**: Various explanatory variables that can be used for forecasting, including:  
   - Weather-related measurements (e.g., temperature, global horizontal irradiation, cloud opacity).  
   - Electricity prices.  

---

## Problem Description

Your goal is to develop a **working forecasting algorithm** to predict the **avg** column (average consumption). While we will test your solution on out-of-sample data, the focus is not on achieving perfect accuracy but rather on how you approach and handle the task.  

You are encouraged to:  
- Preprocess the data appropriately (e.g. scale features).  
- Explore and engineer new features that may improve forecasting performance.  
- Implement proper **cross-validation techniques** to validate your model.  
- Use any forecasting techniques or algorithms you deem appropriate.
- The metric we use in the forecasting pipeline is called **Area Percentage Error** and we define it as follows:

$$ APE(y, \hat{y}) = \frac{\sum_i |y_i - \hat{y_i}|}{\sum_i |y_i|}, $$
where $y$ are true values, and $\hat{y}$ are the predicted/forecasted values. We would like you to use this metric.

---

## Deliverables

1. **Code**  
   - A working implementation of your forecasting algorithm.  
   - Steps for data preprocessing, feature engineering, and model training.  
   - Include clear comments to explain your methodology and logic.  

2. **Model Output**  
   - Your code should produce predictions for the **avg** column on a test dataset (or unseen data).
   - We will use input data in the same format as the provided one - perhaps the best way is to package
   the trained model into a function that takes in a dataframe and outputs the predictions.

3. **Methodology Description**  
   - Describe your thought process and approach to the problem.  
   - Explain the techniques used for data preprocessing, feature engineering, model selection, and validation.  
   - If possible, include a brief evaluation of your model (e.g., errors or performance metrics like RMSE, MAE).  

---

## Additional Notes

- Focus on demonstrating your understanding of **forecasting techniques** rather than perfect accuracy.  
- Use appropriate libraries such as **pandas**, **scikit-learn**, **statsmodels**, or **machine learning frameworks** like TensorFlow or PyTorch.  
- Creativity, thoroughness, and clarity in your approach are key.  


In [None]:
# Write your code here