Data acquisition
Subtask:
Identify and acquire relevant historical data for the Ormanjhi, Ranchi area. This should include solar radiation/sunlight intensity, temperature, humidity, wind speed, and any other weather parameters that might influence solar power generation. Crucially, acquire data specifically for the year 2024, focusing on data points around sunrise hours. If possible, obtain historical solar power generation data from a solar plant in the area, along with details on the solar panel efficiency, number of panels, and total solar capacity of the plant.

In [26]:
# Researching data sources for Ormanjhi, Ranchi (2024, sunrise focus)

print("Researching potential data sources for historical weather and solar power generation data in Ormanjhi, Ranchi (2024).")

print("\nPotential Weather Data Sources:")
print("- Meteorological departments in India (e.g., India Meteorological Department - IMD)")
print("- Commercial weather data providers (e.g., AccuWeather, Weatherbit, OpenWeatherMap - often require API access or paid subscriptions for historical data)")
print("- Research institutions or universities that may have collected localized weather data.")
print("- Satellite data archives (e.g., NASA POWER project - provides solar radiation data, but may have lower temporal or spatial resolution)")
print("\nFocusing on 2024 data, especially around sunrise hours (e.g., hourly data from 4 AM to 8 AM local time), would be ideal.")
print("Key weather parameters to seek: Sunlight intensity (Global Horizontal Irradiance - GHI or Direct Normal Irradiance - DNI), Temperature, Humidity, Wind Speed, Cloud Cover.")


print("\nPotential Solar Power Generation Data Sources:")
print("- Local solar plant operators or owners in the Ormanjhi area.")
print("- State electricity boards or power grid operators (may have aggregated generation data).")
print("- Research projects or academic studies focused on solar energy in the region.")
print("- Publicly available energy data portals (less likely for specific plant data, more likely for aggregated regional data).")
print("\nAcquiring data for a specific solar plant is challenging due to data privacy and accessibility.")
print("If possible, obtaining data on solar panel efficiency, number of panels, and total plant capacity is crucial for a more accurate model.")

print("\nChallenges and Limitations:")
print("- Granularity of data: Hourly or sub-hourly data is preferred for time series forecasting, especially around sunrise.")
print("- Data accessibility: Historical data, particularly from specific solar plants, is often not publicly available.")
print("- Data format and quality: Acquired data may require significant cleaning and preprocessing.")
print("- Geographic specificity: Ensuring the weather data is highly relevant to the Ormanjhi area is important.")

print("\nGiven the limitations of this environment, direct acquisition of specific 2024 data for Ormanjhi is not possible.")
print("For the subsequent steps, a representative dataset (either simulated or a publicly available proxy) covering similar parameters and a relevant time period would be necessary to demonstrate the modeling process.")

Researching potential data sources for historical weather and solar power generation data in Ormanjhi, Ranchi (2024).

Potential Weather Data Sources:
- Meteorological departments in India (e.g., India Meteorological Department - IMD)
- Commercial weather data providers (e.g., AccuWeather, Weatherbit, OpenWeatherMap - often require API access or paid subscriptions for historical data)
- Research institutions or universities that may have collected localized weather data.
- Satellite data archives (e.g., NASA POWER project - provides solar radiation data, but may have lower temporal or spatial resolution)

Focusing on 2024 data, especially around sunrise hours (e.g., hourly data from 4 AM to 8 AM local time), would be ideal.
Key weather parameters to seek: Sunlight intensity (Global Horizontal Irradiance - GHI or Direct Normal Irradiance - DNI), Temperature, Humidity, Wind Speed, Cloud Cover.

Potential Solar Power Generation Data Sources:
- Local solar plant operators or owners in the 

In [30]:
import pandas as pd
import numpy as np

# Acknowledge previous failure
print("Acknowledging the previous failure in acquiring specific real-world data for Ormanjhi, Ranchi.")
print("Due to environmental limitations and data accessibility challenges, it is not possible to obtain the precise historical data as requested.")
print("Therefore, a simulated dataset will be used to proceed with the data science workflow and demonstrate the subsequent steps.")

# Describe the structure of the simulated dataset
print("\nCreating a simulated dataset with the following features:")
print("- timestamp (hourly data for a year in 2024, with a focus on sunrise hours)")
print("- sunlight_intensity (simulated values based on time of day)")
print("- temperature (simulated values with daily and seasonal variations)")
print("- humidity (simulated values with daily and seasonal variations)")
print("- wind_speed (simulated values with random fluctuations)")
print("- solar_panel_efficiency (constant plausible value)")
print("- number_of_panels (constant plausible value)")
print("- solar_capacity_kw (calculated based on efficiency and number of panels)")
print("- power_generation (target variable, simulated based on weather and solar plant parameters)")

# Create a pandas DataFrame containing this simulated data
# Generate hourly timestamps for a year in 2024
start_date = '2024-01-01 00:00:00'
end_date = '2024-12-31 23:00:00'
timestamps = pd.date_range(start=start_date, end=end_date, freq='h')

# Filter for sunrise hours (e.g., 4 AM to 9 AM)
sunrise_hours_timestamps = timestamps[(timestamps.hour >= 4) & (timestamps.hour <= 9)]

# Create a DataFrame with the filtered timestamps
df = pd.DataFrame({'timestamp': sunrise_hours_timestamps})

# Simulate realistic-ish data for weather parameters and power generation
np.random.seed(42) # for reproducibility

# Simulate sunlight intensity: peaks around midday, 0 at night, focus on increase during sunrise
df['sunlight_intensity'] = np.sin((df['timestamp'].dt.hour - 4) / 6 * np.pi) * 800 + np.random.randn(len(df)) * 50
df.loc[df['timestamp'].dt.hour < 4, 'sunlight_intensity'] = 0
df.loc[df['timestamp'].dt.hour > 9, 'sunlight_intensity'] = 0 # Cap after sunrise focus
df['sunlight_intensity'] = df['sunlight_intensity'].clip(lower=0) # Ensure non-negative

# Simulate temperature: fluctuates throughout the day and year
df['temperature'] = 20 + 5 * np.sin(df['timestamp'].dt.hour / 24 * 2 * np.pi) + 10 * np.sin(df['timestamp'].dt.dayofyear / 365 * 2 * np.pi) + np.random.randn(len(df)) * 2

# Simulate humidity: generally decreases with temperature, fluctuates throughout the day and year
df['humidity'] = 70 - 10 * np.sin(df['timestamp'].dt.hour / 24 * 2 * np.pi) - 15 * np.sin(df['timestamp'].dt.dayofyear / 365 * 2 * np.pi) + np.random.randn(len(df)) * 5
df['humidity'] = df['humidity'].clip(lower=0, upper=100) # Ensure realistic range

# Simulate wind speed: random fluctuations with some daily pattern
df['wind_speed'] = 5 + 2 * np.sin(df['timestamp'].dt.hour / 24 * 2 * np.pi) + np.random.randn(len(df)) * 2
df['wind_speed'] = df['wind_speed'].clip(lower=0) # Ensure non-negative

# Simulate solar plant parameters
solar_panel_efficiency = 0.18  # 18% efficiency
number_of_panels = 1000 # Hypothetical number of panels
panel_area_sq_m = 1.6 * 1 # Assuming 1.6m x 1m per panel
total_panel_area_sq_m = number_of_panels * panel_area_sq_m
peak_irradiance = 1000 # W/m^2 (standard test conditions)
solar_capacity_kw = (total_panel_area_sq_m * solar_panel_efficiency * peak_irradiance) / 1000 # Convert W to kW

df['solar_panel_efficiency'] = solar_panel_efficiency
df['number_of_panels'] = number_of_panels
df['solar_capacity_kw'] = solar_capacity_kw

# Simulate power generation based on sunlight intensity, efficiency, and capacity, with some noise
# Power generation is proportional to sunlight intensity, capped by solar capacity
df['power_generation'] = (df['sunlight_intensity'] / peak_irradiance) * solar_capacity_kw + np.random.randn(len(df)) * (solar_capacity_kw * 0.05) # Add some noise
df['power_generation'] = df['power_generation'].clip(lower=0) # Ensure non-negative power generation


# Ensure the timestamp column is in datetime format and set it as the index
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)


# Display the first few rows of the created DataFrame
print("\nFirst 5 rows of the simulated DataFrame:")
print(df.head())

print("\nDataFrame shape:", df.shape)
print("\nDataFrame columns and data types:")
print(df.info())

Acknowledging the previous failure in acquiring specific real-world data for Ormanjhi, Ranchi.
Due to environmental limitations and data accessibility challenges, it is not possible to obtain the precise historical data as requested.
Therefore, a simulated dataset will be used to proceed with the data science workflow and demonstrate the subsequent steps.

Creating a simulated dataset with the following features:
- timestamp (hourly data for a year in 2024, with a focus on sunrise hours)
- sunlight_intensity (simulated values based on time of day)
- temperature (simulated values with daily and seasonal variations)
- humidity (simulated values with daily and seasonal variations)
- wind_speed (simulated values with random fluctuations)
- solar_panel_efficiency (constant plausible value)
- number_of_panels (constant plausible value)
- solar_capacity_kw (calculated based on efficiency and number of panels)
- power_generation (target variable, simulated based on weather and solar plant para

In [14]:
# 1. Identify and handle missing values
print("Checking for missing values:")
missing_values = df.isnull().sum()
print(missing_values)

if missing_values.sum() == 0:
    print("\nNo missing values found in the DataFrame.")
else:
    print("\nMissing values found. Imputation strategy (e.g., mean, median, ffill, bfill) would be applied here if necessary.")
    # Example of filling with mean if there were missing values:
    # df.fillna(df.mean(), inplace=True)
    # print("\nDataFrame after handling missing values (example with mean imputation):")
    # display(df.head())


Checking for missing values:
sunlight_intensity        0
temperature               0
humidity                  0
wind_speed                0
solar_panel_efficiency    0
number_of_panels          0
solar_capacity_kw         0
power_generation          0
dtype: int64

No missing values found in the DataFrame.


In [24]:
# 2. (Correction) Display DataFrame after addressing outliers using print
print("\nDataFrame after addressing outliers:")
print(df.head().to_markdown(index=True, numalign="left", stralign="left"))

# 3. Engineer additional time-based features
print("\nEngineering additional time-based features:")
df['hour'] = df.index.hour
df['day_of_week'] = df.index.dayofweek
df['day_of_year'] = df.index.dayofyear
df['month'] = df.index.month
df['quarter'] = df.index.quarter

print("\nDataFrame with time-based features:")
print(df.head().to_markdown(index=True, numalign="left", stralign="left"))

# 4. Ensure solar plant features are present
print("\nEnsuring solar plant features are present:")
solar_features = ['solar_panel_efficiency', 'number_of_panels', 'solar_capacity_kw']
if all(col in df.columns for col in solar_features):
    print("Solar plant features are present in the DataFrame.")
    print("Solar plant features (first 5 rows):")
    print(df[solar_features].head().to_markdown(index=True, numalign="left", stralign="left"))
else:
    print(f"Error: One or more solar plant features are missing. Expected columns: {solar_features}")



DataFrame after addressing outliers:
| timestamp           | sunlight_intensity   | temperature   | humidity   | wind_speed   | solar_panel_efficiency   | number_of_panels   | solar_capacity_kw   | power_generation   |
|:--------------------|:---------------------|:--------------|:-----------|:-------------|:-------------------------|:-------------------|:--------------------|:-------------------|
| 2024-01-01 04:00:00 | 24.8357              | 25.354        | 62.7032    | 4.97203      | 0.18                     | 1000               | 288                 | 0                  |
| 2024-01-01 05:00:00 | 393.087              | 25.0401       | 57.4189    | 7.47021      | 0.18                     | 1000               | 288                 | 116.453            |
| 2024-01-01 06:00:00 | 725.205              | 23.8892       | 63.9401    | 6.56843      | 0.18                     | 1000               | 288                 | 217.922            |
| 2024-01-01 07:00:00 | 876.151              | 25.97

In [33]:
from sklearn.preprocessing import StandardScaler

# 5. Scale the relevant numerical features
# Exclude 'solar_panel_efficiency', 'number_of_panels', and 'solar_capacity_kw' as they are constant
# Include the target variable 'power_generation' for scaling
numerical_features_to_scale = ['sunlight_intensity', 'temperature', 'humidity', 'wind_speed', 'power_generation']

print("\nScaling numerical features using StandardScaler:")

scaler = StandardScaler()
df[numerical_features_to_scale] = scaler.fit_transform(df[numerical_features_to_scale])

print("\nDataFrame after scaling numerical features:")
print(df.head().to_markdown(index=True, numalign="left", stralign="left"))


Scaling numerical features using StandardScaler:

DataFrame after scaling numerical features:
| timestamp           | sunlight_intensity   | temperature   | humidity   | wind_speed   | solar_panel_efficiency   | number_of_panels   | solar_capacity_kw   | power_generation   |
|:--------------------|:---------------------|:--------------|:-----------|:-------------|:-------------------------|:-------------------|:--------------------|:-------------------|
| 2024-01-01 04:00:00 | -1.78614             | 0.123686      | 0.153585   | -0.876942    | 0.18                     | 1000               | 288                 | -1.88913           |
| 2024-01-01 05:00:00 | -0.407639            | 0.0812056     | -0.298452  | 0.313334     | 0.18                     | 1000               | 288                 | -0.370777          |
| 2024-01-01 06:00:00 | 0.835602             | -0.074508     | 0.259391   | -0.116328    | 0.18                     | 1000               | 288                 | 0.952213        