In [None]:
# -----********************-----

# Created Time: 2025/07/03

# Author: Yiyi He, Tiger Peng

### Use Case

# This notebook create uniform voltage data using the scraped minute-wise voltage data
# The two main data sources in this project are the current operating ESMI station data,
# which is scraped from the Prayas ESMI website using the india_esmi_scraper.py scraper,
# as well as the Harvard Dataverse Data


# The Harvard Dataverse Data is formatted in a 60 x n grid, with each row being labelled with an hour in the day, from 0-23,
# and each of the columns being a minute of the hour from 0-59.

# The scraped data is formatted in two columns, with the time column being the datetime down to the minute,
# and the voltage column being the voltage from 0-255.
# -----********************-----

# Libraries

In [1]:
import os
import pandas as pd
from tqdm import tqdm
import shutil
from datetime import datetime, timedelta
import numpy as np
import math

# ESMI and Harvard Dataverse ID

In addition to having different data formats, the two data sources also use different numbering systems.
Harvard Dataverse uses the names of the 528 stations covered in the time period of the dataset, between 2014-2018.
The ESMI scraper uses the ESMI ids assigned in the dropdown elements used by the web scraper to select stations on the Prayas website.

The India station locations were all cross-referenced (still need to document this), with duplicates and stations outside of the ERA5-land dataset removed. Each of the newly-scraped ESMI stations was given an ID number proceeding 528 (the last number assigned to the Dataverse stations), with duplicates in the Dataverse data removed.

After this process, we have ids from 1-572 (with some missing because duplicates were removed), and a uniform data format.

These files will all be stored in the india_processing/india_uniform directory, and their IDs can be referenced in the file ESMI_India_538_locations.csv.

## Converting Dataverse Voltage Data

We first need to verify that all of the stations deemed to be in the ERA5 dataset's range is in either the Dataverse dataset or the ESMI dataset.

In [2]:
def strip_name(name, exceptions = None):
    formatted = name.split('-')[0].split('[')[0].strip()
    
    if exceptions:
        return exceptions(formatted) # run the name through the entered exceptions function, if provided
    else:
        return formatted

In [None]:
# Import the location information table, which we can use to map between station_id, ESMI_ID, and station names.
locations_path = "../ESMI_India_538_locations.csv"
locations = pd.read_csv(locations_path, dtype={"ESMI_ID" : str, "station_id" : int}, usecols=["station_id", "ESMI_ID", "Location name", "District", "State", "Lat", "Lon"])
locations.rename(columns={"Location name": "station_name", "District" : "district", "State": "state", "Lat": "lat", "Lon": "lon"}, inplace=True)
locations["station_name"] = locations["station_name"].apply(strip_name)