# Transform Raw Data into Features

In the new notebook "Transform raw data into features," we are refactoring the code to streamline the data transformation process. This notebook focuses on building a comprehensive transformation pipeline. By utilizing refactored code from previous notebooks, we aim to enhance the efficiency and clarity of the data preparation steps.

This pipeline will systematically process raw data, ensuring it is clean, structured, and ready for analysis. The transformation steps include handling missing values, feature engineering, and scaling, ultimately converting raw sensor data into meaningful features for machine learning models. This structured approach ensures consistency and reproducibility across our analysis and modeling efforts.

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
import sys
import os
from dotenv import load_dotenv

In [3]:
# Add the src directory to the PYTHONPATH
os.environ['PYTHONPATH'] = os.path.abspath(os.path.join('..', 'src'))
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

# Load environment variables from .env file
load_dotenv()

# Add the src directory to the Python path
sys.path.append(os.getenv('PYTHONPATH'))

**Pollutant Data**

In [10]:
POLLUTANT_URL = "https://datenhub.ulm.de/ckan/api/3/action/datastore_search?resource_id=b49de35e-040c-4530-9208-eefadc97b610"

In [11]:
from pollutant_data_processing import fetch_data
pollutant_df_raw = fetch_data(POLLUTANT_URL)

In [23]:
from pollutant_data_processing import validate_data
pollutant_df_validated = validate_data(pollutant_df_raw)

In [37]:
from pollutant_data_processing import clean_data
pollutant_df_cleaned = clean_data(pollutant_df_validated)

In [38]:
# Resample from hourly to daily frequency
from pollutant_data_processing import resample_data
pollutant_df_resampled = resample_data(pollutant_df_cleaned)

In [39]:
pollutant_df_resampled

pollutant,no2,o3,pm10,pm2.5
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-11-13,21.375000,13.958333,21.083333,12.791667
2020-11-14,25.083333,17.791667,23.666667,13.958333
2020-11-15,22.958333,15.416667,22.291667,13.791667
2020-11-16,16.583333,36.125000,7.875000,3.000000
2020-11-17,23.125000,26.333333,13.458333,6.041667
...,...,...,...,...
2024-04-12,22.542333,58.678333,18.327500,11.380000
2024-04-13,21.202750,60.049167,18.704167,10.898333
2024-04-14,16.921667,59.979167,17.485833,7.860833
2024-04-15,16.585167,54.872500,11.362500,5.818333


**Weather Data**

In [4]:
from weather_data_processing import fetch_weather_data
weather_df_raw = fetch_weather_data()

In [6]:
from weather_data_processing import clean_weather_data
weather_df_cleaned = clean_weather_data(weather_df_raw)

In [8]:
from weather_data_processing import validate_weather_data
weather_df_validated = validate_weather_data(weather_df_cleaned)

In [9]:
weather_df_validated

Unnamed: 0,date,min_temp,max_temp,mean_temp,precipitation,sunshine,snow_depth,max_wind_gust
0,2024-04-06,7.6,25.0,16.1,0.0,11.1,0.0,40.0
1,2024-04-07,9.1,23.0,17.0,0.0,6.6,0.0,26.0
2,2024-04-08,8.9,27.1,18.1,0.0,8.2,0.0,22.0
3,2024-04-09,5.1,18.2,10.6,0.5,1.1,0.0,50.0
4,2024-04-10,1.6,12.2,6.8,0.0,3.7,0.0,21.0
...,...,...,...,...,...,...,...,...
1339,2020-11-16,6.8,11.2,8.3,0.1,0.5,0.0,39.0
1340,2020-11-17,1.3,10.8,6.8,0.0,5.8,0.0,27.0
1341,2020-11-18,-1.4,10.5,4.9,0.0,8.5,0.0,20.0
1342,2020-11-19,3.7,9.5,6.3,6.4,0.7,0.0,44.0


### Feature Engineering

In [91]:
from feature_engineering import *
timeseries_df = merge_data(pollutant_df_resampled, weather_df_validated)

In [93]:
# Perform feature engineering
timeseries_df = perform_feature_engineering(timeseries_df)

In [96]:
timeseries_df

Unnamed: 0,date,no2,o3,pm10,pm2.5,min_temp,max_temp,mean_temp,precipitation,sunshine,...,temp_squared,windspeed_squared,sunshine_squared,temp_cubed,windspeed_cubed,sunshine_cubed,temp_precipitation_ratio,windspeed_temp_ratio,day_of_week_month,month_year
0,2020-11-13,21.375000,13.958333,21.083333,12.791667,3.3,11.9,7.8,0.0,1.4,...,60.84,484.0,1.96,474.552,10648.0,2.744,7.800000,2.500000,44,22220
1,2020-11-14,25.083333,17.791667,23.666667,13.958333,1.9,13.3,8.4,0.0,8.7,...,70.56,576.0,75.69,592.704,13824.0,658.503,8.400000,2.553191,55,22220
2,2020-11-15,22.958333,15.416667,22.291667,13.791667,1.5,13.5,7.8,2.0,7.9,...,60.84,1369.0,62.41,474.552,50653.0,493.039,2.600000,4.204545,66,22220
3,2020-11-16,16.583333,36.125000,7.875000,3.000000,6.8,11.2,8.3,0.1,0.5,...,68.89,1521.0,0.25,571.787,59319.0,0.125,7.545455,4.193548,0,22220
4,2020-11-17,23.125000,26.333333,13.458333,6.041667,1.3,10.8,6.8,0.0,5.8,...,46.24,729.0,33.64,314.432,19683.0,195.112,6.800000,3.461538,11,22220
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1246,2024-04-12,22.542333,58.678333,18.327500,11.380000,1.3,20.6,11.3,0.0,12.7,...,127.69,400.0,161.29,1442.897,8000.0,2048.383,11.300000,1.626016,16,8096
1247,2024-04-13,21.202750,60.049167,18.704167,10.898333,4.2,23.6,14.7,0.0,12.1,...,216.09,676.0,146.41,3176.523,17576.0,1771.561,14.700000,1.656051,20,8096
1248,2024-04-14,16.921667,59.979167,17.485833,7.860833,8.3,24.9,16.2,0.4,10.9,...,262.44,1600.0,118.81,4251.528,64000.0,1295.029,11.571429,2.325581,24,8096
1249,2024-04-15,16.585167,54.872500,11.362500,5.818333,2.2,16.2,11.7,2.6,3.0,...,136.89,3844.0,9.00,1601.613,238328.0,27.000,3.250000,4.881890,0,8096


We have successfully built a comprehensive feature pipeline that encompasses the entire process from data extraction to feature engineering. The pipeline begins with the extraction of raw weather and pollutant data from specified sources. The raw data is then cleaned and validated to ensure accuracy and consistency. Following this, we resample the data to a desired frequency and merge the pollutant and weather datasets based on common dates. Finally, we extract meaningful date-related features such as day of the week, month, year, and season from the 'date' column. This structured approach allows for efficient data processing and prepares the dataset for subsequent analysis and modeling tasks.