Skip to content

minin2000/temperature-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Temperature prediction in Moscow for 24 hours

Description

The goal of the project:

  1. Train a model that will predict the temperature in Moscow for the next 24 hours in 3-hour increments. (8 predictions)
  2. Automate the temperature forecasting process based on new data.

Table of Contents

Installation

PostgreSQL

Before start process locally, required to create database connection, use next settings (they set as default in project):
- Host - localhost
- Database - postgres
- Port - 5432
- Username - username
- Password - qwerty

pgAdmin4:

pgadmin_1 pgadmin_2

Make sure that you have user - username with Superuser

pgadmin_3

Dbeaver:

postgresConnectionExample

Airflow (ETL + Predict)

For start process locally, follow the steps below:

  1. Download folder Airflow locally.

  2. Make sure that docker is running and docker-engine has sufficient memory allocated.

Before run Airflow, prepare the environment by executing the following steps:

  • If you are working on Linux, specify the AIRFLOW_UID by running the command:
echo -e "AIRFLOW_UID=$(id -u)" > .env
  • Perform the database migration and create the initial user account by running the command:
docker compose up airflow-init

The created user account will have the login airflow and the password airflow.

  1. Start Airflow and build docker containers:
docker compose up --build -d

Project diagram

Project diagram

Project Description

The project consists of the following parts:

- ETL – download historical temperature data in Moscow.
The data is downloaded from the website https://rp5.ru/Weather_archive_in_Moscow, and then loaded into Postgresql. If a new line is added (new temperature), DAG Predict is launched.

- Train model – model training pipeline. Optional part, because the model has already been trained.

- Predict – temperature prediction in Moscow for the next 24 hours. Triggered when a new temperature row is added in the ETL step.

Apache Аirflow is used as an orchestrator.

ETL and Predict are loaded as DAGs in Apache Airflow.

Train model is located in a separate folder.

ETL

ETL process diagram

ETL drawio

Description of steps

- Init browser – set options for the driver, indicate the saving path for the downloaded files. Open the browser, return driver.

- Download archive – Go to the website https://rp5.ru/Weather_archive_in_Moscow, enter the weather station - 27612, enter the date range. Download the archive. Returning the path to the archive.

- Unzip Archive – unzip the archive and return the path to the Excel file.

- Preprocess data – We read the excel file and return a dataframe of historical data.

If there is no 'weather' table in the Database (first run or table deleted) then:
 - Create db table – create a table weather. Loading historical data.

- Update db table – Load new data into the ‘weather’ table if there is any.

Train model

Train model process diagram

Train model drawio

Description of steps

- Start – run docker container or run locally in IDE.

- Load raw data – get the data from the database required for training the model. By default, all rows available in the database are retrieved. To change the data used, CONFIG.py has the fields 'date_from' and 'date_to'. Use them for reproducible experiments.

- Preprocess data – Fill NaN values, create features, create targets (24 hours, 8 columns with 3 hour increments), divide the data into train/val/test.

- Tune model – Optional stage. By default, hyperparameters are already defined in CONFIG.py. If you want to tune hyperparameters yourself, uncomment the part with tune_model. After tunning, the tunned hyperparameters will be used when training the model.

- Train model – train the model, calculate MAE, save the model in the output folder, save training information in mlflow.

Predict

Predict process diagram

Predict drawio

Description of steps

- Start – event based. It is launched after the DAG ETL has completed and a new row with data has been added to the ‘weather’ datatable.

If there is no table 'weather_predictions' in the Database (first run or table deleted) then:
  - Create db table – create a ‘weather_predictions’ table.

- Load model – load the trained model.

- Load raw data – get the data from the ‘weather’ datatable, that required to predict the temperature of the next 24 hours.

- Preprocess data – Fill NaN if any, create the dataframe necessary for prediction.

- Predict – predict the weather for the next 24 hours.

Postprocess data – convert prediction results into a dataframe format.

Insert to db table – enter predicted values into the ‘weather_predictions’ datatable.

Run Process

Airflow (ETL+Predict)

After step 'Installation' is completed, follow for the next steps:

  1. Access the Airflow web interface in your browser at http://localhost:8080.

  2. Login as Username - airflow, password - airflow

airflow

  1. Turn DAG Weather_ETL, wait until it finishes. It will create table weather in PostgreSQL with historical weather data.

dag weather etl

  1. Turn DAG Weather_prediction, it will be triggered by Weather_ETL. DAG will create table weather_predictions where will be predictions for the next 24 hours.

dag weather prediction

DAG Weather_ETL will be triggered every 3 hours and will check if new historical data appeared. If new data appears, Weather_ETL will trigger Weather_prediction for making new predictions.

When you are finished working and want to clean up your environment, run:

docker compose down --volumes --rmi all

Train model (optional)

If you want to train model, follow next steps:

  1. Download folder train model locally

  2. In folder create virtual environment:

python3 -m venv env
  1. Choose created virtual environment:
 source env/bin/activate
  1. Install required libs from requirements.txt:
 pip install -r requirements.txt
  1. Run mlflow ui:
 mlflow ui
  1. Access the MLFlow web interface in your browser at http://127.0.0.1:5000.

  2. Uncomment 'params = tune_model(file_dirs, CONFIG)' (Optional, for tunning your model)

  3. Run main.py

After main.py finishes run, you will find results in MLFlow. Model will be in your-project-folder/output

mflow

Feel free to create new features in preprocess_data/feature_engineering, tune your model, change data to use (CONFIG['date_from'] CONFIG['date_to'])

Result

The following columns were selected as the best columns for weather forecasting, their weights are presented below.

feature_importance_weight

Chosen model: XGBRegressor
MAE: 1.008 (for 8200 test set datetime)

Example of predictions

On a screeshot below represents table, where:
datetime - date and time from which the predictions was made
temp_X - real temperature after X hours after datetime
pred_temp_X - predicted temperature afrer X hours after datetime
MAE_X - error of prediction
MAE - mean error of all predictions from datetime (8 predictions for each 3 hours)

example_table