## Plant Level Clustering

[Home Page](https://john-grando.github.io/)

### Initial Setup

In [1]:
import os, sys
import pprint as p
import pyspark.sql.functions as pysF
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

py_file_path = os.path.join(
    os.getcwd(),
    "..",
    ".."
)

sys.path.append(py_file_path)
from app.SparkTools import MyPySpark

MySpark = None

#ensure only one sc and spark instance is running
global MySpark
MySpark = MySpark or MyPySpark(
    master = 'local[3]', 
    logger_name = 'jupyter')

In [19]:
electricity_dim_df = MySpark\
    .spark\
    .read\
    .parquet("/Processed/ElectricityPlantLevelDimDF")\
    .filter(
        (pysF.col("series_id").rlike("^ELEC\.PLANT\.GEN\.")) &
        (pysF.lower(pysF.col("engine_type")) == 'all primemovers') &
        (pysF.col("f") == "M"))\
    .withColumn(
        "fuel_type",
        pysF.regexp_replace(
            pysF.regexp_replace(
                pysF.col("fuel_type"),
                "[^a-zA-Z0-9\s]", 
                ""),
            "\s",
            "_"))\
    .filter(pysF.col("iso3166").isNotNull())\
    .withColumn(
        "state",
        pysF.regexp_extract(
            pysF.col("iso3166"),
            r"^USA-([A-Z]+)",
            1))\
    .drop(
        "f", 
        "copyright", 
        "description", 
        "name", 
        "source", 
        "value_type", 
        "engine_type", 
        "frequency", 
        "iso3166",
        "geography",
        "last_updated")

electricity_fact_df = MySpark\
    .spark\
    .read\
    .parquet("/Processed/ElectricityFactDF")

electricity_df = electricity_fact_df.join(
    pysF.broadcast(electricity_dim_df),
    on = "series_id",
    how = "right"
)

In [20]:
electricity_df.columns

['series_id',
 'date',
 'value',
 'end',
 'lat',
 'lon',
 'start',
 'units',
 'plant_name',
 'fuel_type',
 'plant_id',
 'state']

### data transformation ideas
Convert grain to one row per plant, but need to express generation in some way
- Note, the grain on the current data is one row per plant per fuel type per month (if all_primemovers is removed)
- Maybe just do all metrics then pivot on fuel type for one col per fuel_type
- last 5 year by fuel type
- 5 year avg 5 years ago fuel type
- above metrics but expressed in terms of % of current value
- lag 
- lag by % of current value
- % fuel type?  are there mixed fuel plants?
- is grouping by fuel type necessary? again, are there mixed fuel types?
- start date? is that when the plant was operational?  maybe also cut off at end date if that is the case.

In [21]:
electricity_df.limit(10).toPandas().head(5)

Unnamed: 0,series_id,date,value,end,lat,lon,start,units,plant_name,fuel_type,plant_id,state
0,ELEC.PLANT.GEN.10-BIT-ALL.M,2018-07-01,0,201807,32.6017,32.6017,200101,megawatthours,Greene County (10),bituminous_coal,10,AL
1,ELEC.PLANT.GEN.10-BIT-ALL.M,2018-06-01,0,201807,32.6017,32.6017,200101,megawatthours,Greene County (10),bituminous_coal,10,AL
2,ELEC.PLANT.GEN.10-BIT-ALL.M,2018-05-01,0,201807,32.6017,32.6017,200101,megawatthours,Greene County (10),bituminous_coal,10,AL
3,ELEC.PLANT.GEN.10-BIT-ALL.M,2018-04-01,0,201807,32.6017,32.6017,200101,megawatthours,Greene County (10),bituminous_coal,10,AL
4,ELEC.PLANT.GEN.10-BIT-ALL.M,2018-03-01,0,201807,32.6017,32.6017,200101,megawatthours,Greene County (10),bituminous_coal,10,AL
