# Power System Data Collection and Validation
This project aims to gather, clean, and validate time series datasets related to power and energy systems. The data will be centralized, documented, and used to support planning and simulation tasks within energy network research.

---

#### Instructions

1. **Run Step 1**  
   Creates the `DataFrame` containing all available data sources.

2. **Run Step 2**  
   Defines the `query_data_sources()` function used to filter datasets based on specific criteria.

3. **Edit and Run Step 3**  
   Call the `query_data_sources()` function with your desired filters  

4. **Run Step 4**  
   Display direct links to each matched dataset's subfolder.

5. **Explore Further**  
   Detailed Jupyter Notebook is available in each source's subfolder.


In [6]:
## Step 1

# DataFrame summarizing all available power system data sources
import pandas as pd
pd.set_option('display.max_colwidth', None)

data_sources = [
    {
        "Source": "AgenceORE",
        "Description": "Aggregated half-hourly electricity consumption data from consumption points with power subscriptions below 36kVA.",
        "Number of Profiles": "130",
        "Profile Types": ["load", "consumption points", "energy consumption"],
        "Load": ["active", "aggregated", "residential"],
        "Renewable": [],
        "Environment": [],
        "Economy": [],
        "Processed": True,
        "Synthetic": False,
        "Horizon": "2020–2024",
        "Time Resolution": ["30min"],
        "Geographical": ["France"],
        "Folder": "AgenceORE/"
    },
    {
        "Source": "OPSD",
        "Description": "Open Power System Data - EU-wide TSO-provided time series",
        "Number of Profiles": "298",
        "Profile Types": ["load", "renewable", "capacity", "price"],
        "Load": ["active", "aggregated", "national", "historical"],
        "Renewable": ["solar", "wind"],
        "Environment": [],
        "Economy": ["price"],
        "Processed": True,
        "Synthetic": False,
        "Horizon": "2015-2020",
        "Time Resolution": ["15min", "30min", "60min"],
        "Geographical": ["EU", "United Kingdom", "Switzerland", "Norway", "Montenegro", "Serbia", "Ukraine"],
        "Folder": "OPSD_TimeSeries/"
    },
    {
        "Source": "SimBench",
        "Description": "Synthetic power system benchmark datasets for grid studies",
        "Number of Profiles": 614,
        "Profile Types": ["load", "renewable", "powerplant", "storage"],
        "Load": ["active", "reactive", "residential", "industry", "commercial"],
        "Renewable": ["solar", "wind", "biomass", "hydro"],
        "Environment": [],
        "Economy": [],
        "Processed": True,
        "Synthetic": True,
        "Horizon": "2016-2017",
        "Time Resolution": ["15min"],
        "Geographical": ["Germany"],
        "Folder": "SimBench/"
    }
]

df = pd.DataFrame(data_sources)
df

Unnamed: 0,Source,Description,Number of Profiles,Profile Types,Load,Renewable,Environment,Economy,Processed,Synthetic,Horizon,Time Resolution,Geographical,Folder
0,AgenceORE,Aggregated half-hourly electricity consumption data from consumption points with power subscriptions below 36kVA.,130,"[load, consumption points, energy consumption]","[active, aggregated, residential]",[],[],[],True,False,2020–2024,[30min],"[France, Region]",AgenceORE/
1,OPSD,Open Power System Data - EU-wide TSO-provided time series,298,"[load, renewable, capacity, price]","[active, aggregated, national, historical]","[solar, wind]",[],[price],True,False,2015-2020,"[15min, 30min, 60min]","[EU, United Kingdom, Switzerland, Norway, Montenegro, Serbia, Ukraine]",OPSD_TimeSeries/
2,SimBench,Synthetic power system benchmark datasets for grid studies,614,"[load, renewable, powerplant, storage]","[active, reactive, residential, industry, commercial]","[solar, wind, biomass, hydro]",[],[],True,True,2016-2017,[15min],[Germany],SimBench/


In [2]:
## Step 2

# Query function for filtering data sources
def query_data_sources(df, 
                       load=None, 
                       renewable=None, 
                       environment=None, 
                       economy=None,
                       synthetic=None,
                       processed=None,
                       geographical=None):
    
    EU_COUNTRIES = [
        "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czech Republic", "Denmark", "Estonia",
        "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania",
        "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal", "Romania", "Slovakia", "Slovenia",
        "Spain", "Sweden"
    ]
    
    result = df.copy()

    if load:
        result = result[result["Load"].apply(lambda x: all(item in x for item in load))]

    if renewable:
        result = result[result["Renewable"].apply(lambda x: all(item in x for item in renewable))]

    if environment:
        result = result[result["Environment"].apply(lambda x: all(item in x for item in environment))]

    if economy:
        result = result[result["Economy"].apply(lambda x: all(item in x for item in economy))]

    if synthetic is not None:
        result = result[result["Synthetic"] == synthetic]

    if processed is not None:
        result = result[result["Processed"] == processed]

    if geographical:
        # normalize to list
        if isinstance(geographical, str):
            geographical = [geographical]

        def geo_match(source_geo):
            expanded = []
            for g in source_geo:
                if g == "EU":
                    expanded.extend(EU_COUNTRIES)
                else:
                    expanded.append(g)
            return all(g in expanded for g in geographical)

        result = result[result["Geographical"].apply(geo_match)]

    return result.reset_index(drop=True)


#### How to Query the Dataset

You can use the `query_data_sources()` function to filter datasets based on specific criteria such as:

- Type of load (e.g., active, reactive)
- Type of renewable source (e.g., solar, wind)
- Whether the dataset is synthetic or real
- Presence of price/economic data
- Time resolution or geographical coverage

#### Example Queries:

- Find datasets with both solar and wind data  
- Find datasets that are synthetic only  
- Find datasets that include price data and wind generation  
- Filter by specific load types like residential or industrial

Use the function in code cells like this:


In [3]:
## Step 3

# Here are some examples

# Find datasets with specific load types
# results = query_data_sources(df, load=["active", "reactive"])

# Find datasets with both solar and wind
results = query_data_sources(df, renewable=["solar", "wind"])

# Find only synthetic datasets
# results = query_data_sources(df, synthetic=True)

# Find datasets with price and solar generation
# results = query_data_sources(df, renewable=["solar"], economy=["price"])

# Combine multiple filters
# results = query_data_sources(df, load=["active"], renewable=["wind"], synthetic=False)

# Find all datasets that include solar renewable data in France
# results = query_data_sources(df, renewable=["solar"], geographical="Poland")

# Show result
results

Unnamed: 0,Source,Description,Number of profiles,Profile Types,Load,Renewable,Environment,Economy,Processed,Synthetic,Horizon,Time_Resolution,Geographical,Folder
0,OPSD,Open Power System Data - EU-wide TSO-provided time series,298,"[load, renewable, capacity, price]","[active, aggregated, national, historical]","[solar, wind]",[],[price],True,False,2015-2020,"[15min, 30min, 60min]","[EU, United Kingdom, Switzerland, Norway, Montenegro, Serbia, Ukraine]",OPSD_TimeSeries/
1,SimBench,Synthetic power system benchmark datasets for grid studies,614,"[load, renewable, powerplant, storage]","[active, reactive, residential, industry, commercial]","[solar, wind, biomass, hydro]",[],[],True,True,2016-2017,[15min],[Germany],SimBench/


In [4]:
## Step 4

from IPython.display import Markdown

def show_folder_links(results):
    if results.empty:
        return Markdown("**No datasets match your query.**")
    
    links = [f"- [{row['Source']}]({row['Folder']})" for _, row in results.iterrows()]
    return Markdown("\n".join(links))

show_folder_links(results)

- [OPSD](OPSD_TimeSeries/)
- [SimBench](SimBench/)