<div style="background-color: #e6e6fa; padding: 10px;">
    <h1>Introduction</h1>
</div>
<div style="display: flex; justify-content: space-between; align-items: flex-start;">
    <div style="padding: 10px; width: 70%;"> <!-- Adjust width to give more space to the image -->
        <p>This notebook is about <a href="https://www.kaggle.com/competitions/predict-energy-behavior-of-prosumers">Enefit competition</a>, which has just concluded. We used all these features, among others, and they secured us the <strong style="color: green; text-decoration: underline;">39th position </strong> on the Public Leaderboard.
 These insights might capture your interest, especially if you've previously engaged in this competition or if you're keen on exploring feature engineering strategies for Kaggle competitions. In this notebook, I'll share some key features that have either boosted my CV score or helped reduce the total feature count in my model without significantly affecting the score. While I plan to discuss with my team members about publishing our final submission, this notebook will focus solely on the features I personally developed. My goal is to clearly explain these features. While this notebook isn't submitted for scoring, I've made an effort to present these additional features in an understandable manner. Please note that the code snippets provided in this notebook are partial.</p>
        <p>I have 4 sections:</p>
        <ul>
            <li><strong>Sun Angle Features:</strong> Using the ephem library to add sun elevation feature.</li>
            <li><strong>Holiday Features:</strong> Using the holiday categorical feature.</li>
            <li><strong>Summer Time:</strong> Using Daylight Saving Time as a feature.</li>
            <li><strong>Streamlining Forecast and Historical Weather Features:</strong> Using weighted averaging to decrease the number of features.</li>
        </ul>
        <h3 style="color: green;"><strong><em>I hope you will enjoy my notebook. If you find it useful, please consider upvoting.</em></strong></h3>
    </div>
    <div style="padding: 10px; width: 30%; text-align: right;"> <!-- Directly set the width to increase the image size -->
        <img src="https://github.com/ocarhaci/images/blob/main/feature_engineering.jpg?raw=true" alt="feature engineering" style="width: 100%; height: auto;">
    </div>
</div>


<div style="background-color: #e6e6fa; padding: 10px;">
    <h1>Sun Angle Features</h1>
</div>

<div style="display: flex; justify-content: space-between; align-items: flex-start;">
    <div style="padding: 10px; width: 70%;"> <!-- Maintain the width for text -->
        <p>This feature is my favorite innovation in this competition, mainly because it goes beyond simple feature engineering by introducing new data. It leverages the ephem library in Python, which is designed for precise astronomical calculations. This library allows us to determine the positions of celestial bodies, including stars, planets, and the moon, from any point on Earth at any time. So, how does this relate to our competition?

The production of solar energy is heavily dependent on sunlight. Using the ephem library, we can calculate the sun's elevation (altitude) for each county at every hour, given latitude, longitude, and local time, all of which are available in our data.

While the library also offers other data points like solar distance and azimuth, incorporating these into my model didn't improve my CV score. Nonetheless, I believe there's potential for further exploration and improvement using this approach.</p>
    </div>
    <div style="padding: 10px; width: 30%; text-align: right;"> <!-- Set the width for image -->
        <img src="https://github.com/ocarhaci/images/blob/main/sun.jpg?raw=true" alt="sun feature" style="width: 100%; height: auto;">
    </div>
</div>


In [None]:
import ephem

def compute_sun_elevation(latitude, longitude, date_time):
    observer = ephem.Observer()
    observer.lat = str(latitude)  # Latitude in degrees as a string
    observer.lon = str(longitude) # Longitude in degrees as a string
    observer.date = date_time     # Date and time in UTC

    sun = ephem.Sun(observer)
    elevation = sun.alt * 57.2957795  # Convert radians to degrees
    return elevation


def convert_local_time_to_utc(local_time, time_difference):
    return local_time - timedelta(hours=time_difference)


def compute_sun_elevation_for_county(county, datetime_val, time_difference):
    if county in county_coords:
        coords = county_coords[county]
        latitude = coords['latitude']
        longitude = coords['longitude']

        # Convert local datetime to UTC
        utc_time = convert_local_time_to_utc(datetime_val, time_difference)

        # Compute sun elevation
        return compute_sun_elevation(latitude, longitude, utc_time)
    else:
        return None

class FeaturesGenerator:
        
    def _sun_features(self, df_features):
       
        county_series = df_features['county']
        datetime_series = df_features['datetime']

        # Compute sun elevation for each time difference
        sun_elevation_GMT2 = pl.Series([compute_sun_elevation_for_county(county, datetime, 2) for county, datetime in zip(county_series, datetime_series)])
        sun_elevation_GMT3 = pl.Series([compute_sun_elevation_for_county(county, datetime, 3) for county, datetime in zip(county_series, datetime_series)])

        # Add the computed series as new columns to df_features
        df_features = df_features.with_columns([
            sun_elevation_GMT2.alias('sun_elevation_GMT2'),
            sun_elevation_GMT3.alias('sun_elevation_GMT3')
        ])
        return df_features

<div style="background-color: #e6e6fa; padding: 10px;">
    <h1>Holiday Features</h1>
</div>

<div style="display: flex; justify-content: space-between; align-items: flex-start;">
    <div style="padding: 10px; width: 70%;"> <!-- Maintain the width for text -->
        <p>While holiday features have been covered in previous public notebooks, typically as binary indicators, my analysis revealed a crucial insight: not all holidays are the same. Certain holidays encourage people to stay indoors, while others inspire outdoor activities or attending events. This variation means the nature of the holiday significantly impacts data patterns.

Acknowledging this, I classified holidays by their type, transforming the holiday indicator into a categorical feature rather than a simple binary one. This adjustment led to a notable improvement in Mean Absolute Error (MAE), particularly for data points on holiday dates.</p>
    </div>
    <div style="padding: 10px; width: 30%; text-align: right;"> <!-- Set the width for image container and align text (and inline elements) to the right -->
        <img src="https://github.com/ocarhaci/images/blob/main/christmas.jpg?raw=true" alt="christmas" style="width: 100%; height: auto; display: inline-block;"> <!-- Image at 50% width, inline-block will respect text-align -->
    </div>
</div>



In [None]:
import holidays
import datetime

holidays = CountryHoliday('EE', years= [2021,2022,2023,2024,2025,2026])

def add_holidays(train):
    train['Holiday_Nm'] = train['date'].map(holidays).fillna('Not Holiday')
    return train
train = add_holidays(train, estonian_holidays)
train['country_holiday'] = train['country_holiday'].astype('category')

<div style="background-color: #e6e6fa; padding: 10px;">
    <h1>Summer Time</h1>
</div>


<div style="display: flex; justify-content: space-between; align-items: flex-start;">
    <div style="padding: 10px; width: 70%;"> <!-- Maintain the width for text -->
        <p>Estonia, like many European countries, observes Daylight Saving Time (DST), adjusting clocks to make better use of daylight in the evenings.

<strong>Standard Time (GMT+2):</strong> During winter, Estonia follows Eastern European Time (EET), which is GMT+2. This is their standard time.
            
<strong>Daylight Saving Time (GMT+3):</strong> In summer, Estonia switches to Eastern European Summer Time (EEST), GMT+3, moving clocks forward to extend daylight in the evening hours.
            
Incorporating DST as a binary feature is beneficial because many weather-related data points, which are crucial for historical and forecast analysis, depend on time. High-performing models often leverage time-delayed features, like temperature from 24 hours ago, and DST adjustments impact these calculations.

To streamline features, you can integrate the DST adjustment by combining sun_elevation_GMT+2 and sun_elevation_GMT+3 features (that I explained in the Sun Angle Features part), effectively reducing three separate features to one. This method simplifies the model without losing the nuanced impact of DST on time-dependent weather data.</p>
    </div>
    <div style="padding: 10px; width: 30%; text-align: right;"> <!-- Set the width for image container and align text (and inline elements) to the right -->
        <img src="https://github.com/ocarhaci/images/blob/main/summer_time_change.jpg?raw=true" alt="time change" style="width: 100%; height: auto; display: inline-block;"> <!-- Image at 50% width, inline-block will respect text-align -->
    </div>
</div>

In [None]:
def is_summer_time_vectorized(df, dst_start, dst_end):
    # Vectorized approach to determine if each row falls within DST
    df['current_datetime'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])
    df['dst_start_datetime'] = df['year'].map(dst_start)
    df['dst_end_datetime'] = df['year'].map(dst_end)

    df['is_summer_time'] = ((df['current_datetime'] >= df['dst_start_datetime']) & 
                            (df['current_datetime'] < df['dst_end_datetime'])).astype(int)
    
    df['sun_elevation_GMT3_summer'] = df['sun_elevation_GMT3'] * df['is_summer_time']
    df['sun_elevation_GMT2_summer'] = df['sun_elevation_GMT2'] * (-1*(df['is_summer_time']-1))
    df['sun_elevation_real'] = df['sun_elevation_GMT3_summer'] + df['sun_elevation_GMT2_summer']
    
    # Drop the temporary columns
    df.drop(['current_datetime', 'dst_start_datetime', 'dst_end_datetime', 'sun_elevation_GMT3', 'sun_elevation_GMT2', 'sun_elevation_GMT3_summer', 'sun_elevation_GMT2_summer'], axis=1, inplace=True)
    return df

<div style="background-color: #e6e6fa; padding: 10px;">
    <h1>Streamlining Forecast and Historical Weather Features</h1>
</div>


<div style="display: flex; justify-content: space-between; align-items: flex-start;">
    <div style="padding: 10px; width: 70%;"> <!-- Maintain the width for text -->
        <p>While I cannot provide specific code snippets due to their distribution across different sections of our work, an important concept we've applied involves the use of weather data over various time frames—24h, 48h, 72h, etc.—which has proven particularly impactful on energy consumption predictions. This is because the temperature inside a building is influenced by external temperatures with some delay, a result of insulation properties, and also tends to average out over time. For instance, a single cold day in a generally warm week might not significantly alter the indoor temperature.

To manage the expanding number of features from including all historical data points individually, I opted for a weighted average approach for the past week's weather data, applying geometrically increasing weights. The weights used were [1, 1.3, 1.7, 2.2, 2.9, 3.8, 4.9, 6.4], following a multiplier of 1.3. This means the weather data from yesterday is weighted at 6.4, while data from eight days ago is at 1.

Additionally, I streamlined the model by removing certain forecast and historical weather data that proved less impactful, such as those related to rain, snowfall, and radiation. This selective inclusion helped reduce the feature set without compromising the model's predictive capability.</p>
    </div>
    <div style="padding: 10px; width: 30%; text-align: right;"> <!-- Set the width for image container and align text (and inline elements) to the right -->
        <img src="https://github.com/ocarhaci/images/blob/main/time_series.jpg?raw=true" alt="time series" style="width: 100%; height: auto; display: inline-block;"> <!-- Image at 50% width, inline-block will respect text-align -->
    </div>
</div>

<h2 style="color: green;"><strong><em>Thanks for reading, hope it was useful. </em></strong></h2>