## System 1: Cleaning Gate Power Consumption Data from Schneider Electric

**Inputs:**

* **1a:** (CSVs) Schneider Electric Gate Data after "Super Macro Ultra" has been run. Multiple CSVs are valid, but data should not overlap. Sample: ![](screenshots/1a_sample_meter_consumption_data_input.png)


* **1b:** (CSV) Mapping that maps column names in 1a to correct gate names (i.e. what you have in your operations data). Sample: ![](screenshots/1b_sample_gate_mapping_table.png)


**Output:**

* **1c:** (CSV) Cleaned CSV/Pandas dataframe with power for each time period at each gate. Example: ![](screenshots/1c_sample_cleaned_consumption_data_output.png)


In [1]:
# Dependencies

# For any missing libraries, just run (remove curly braces):
# !pip install {library_name}
 
## Required
import numpy as np
import pandas as pd
import datetime # support datetime

## Recommended (mainly for visualizations)
from tqdm import tqdm_notebook as tqdm # Provides progress bar for long operations 
import matplotlib # Data visualization tool
import seaborn as sns # Data visualization tool on top of matplotlib
matplotlib.use('nbagg') # Enables interactive figures
import matplotlib.pyplot as plt # Plotting interface for matplotlib
import plotly.express as px # Plotly plotting tool
# Enables inline visualizations:
%matplotlib inline

In [2]:
# Inputs, settings, and toggles

input_1a_location = 'sample_data/1a_consumption_input_data'
input_1a_consumption_data_csvs = ['20.csv', '21.csv', '22.csv', '23.csv', '24.csv', '25.csv', '26.csv']

input_1b_gate_to_meter_metric_mapping = "sample_data/1b_mapping.csv"

output_1c_desired_filename = "sample_data/1c_cleaned_consumption_data.csv"

# Helper function that controls whether a certain time period should be excluded from the output data
# Returns True if the time period should be included, and False otherwise
#
# For example, Daylight Savings Time is escaped in the example below to avoid confusion (it is already
# a very low activity time period). To avoid using this function, you could just "return True"
def should_include_time(curr_timestamp):
    start = datetime.datetime(day=3, month=11, year=2019, hour=1, minute=0, second=0)
    end = datetime.datetime(day=3, month=11, year=2019, hour=3, minute=0, second=0)
    if start < curr_timestamp < end:
        return False
    return True

# The following variables control what defines valid power consumption for a given gate, in kW
# We implemented this check because we noticed that when the consumption data changed in granularity, there were
# buggy spikes in power (in the 1000s). Rows with power outside these bounds will be discarded, as will their 
# n surrounding rows (defined in the last variable)
power_lower_bound_kW = -50 # Default is -50
power_upper_bound_kW = 250 # Default is 250
surrounding_rows_to_discard = 20 # Default is 20

### Code begins below:

In [3]:
# Create a DataFrame for each of the consumption CSVs (input 1a)
dfs = {}
for csv in input_1a_consumption_data_csvs:
    dfs[csv] = pd.read_csv("%s/%s" % (input_1a_location, csv), thousands=',')
    
# This is the form our data starts as (as it comes from the metering/consumption system):
dfs[input_1a_consumption_data_csvs[0]].head(3)

Unnamed: 0,Timestamp,BA_A.032-001--Real Energy Into the Load--(kWh),BA_A.032-001--Real Energy Total--(kWh),BA_A.032-002--Real Energy Into the Load--(kWh),BA_A.032-002--Real Energy Total--(kWh),BA_A.032-003--Real Energy Into the Load--(kWh),BA_A.032-003--Real Energy Total--(kWh),BA_A.032-004--Real Energy Into the Load--(kWh),BA_A.032-004--Real Energy Total--(kWh),BA_A.032-005--Real Energy Into the Load--(kWh),...,BA_A.032-006--Real Energy Total--(kWh),BA_A.032-007--Real Energy Into the Load--(kWh),BA_A.032-007--Real Energy Total--(kWh),BA_A.032-008--Real Energy Into the Load--(kWh),BA_A.032-008--Real Energy Total--(kWh),BA_A.032-009--Real Energy Into the Load--(kWh),BA_A.032-009--Real Energy Total--(kWh),BA_A.032-010--Real Energy Into the Load--(kWh),BA_A.032-010--Real Energy Total--(kWh),Unnamed: 21
0,9/1/2019 12:05:00 AM,,135487.07,,235010.9,,185317.22,,198285.44,,...,427419.6,,457826.04,,358718.8,,280811.42,,345921.94,
1,9/1/2019 12:10:00 AM,,135487.07,,235010.9,,185317.26,,198285.45,,...,427419.67,,457831.97,,358718.87,,280811.49,,345921.94,
2,9/1/2019 12:15:00 AM,,135487.07,,235010.9,,185317.82,,198285.46,,...,427419.74,,457837.79,,358718.95,,280811.56,,345921.95,


As you can see from the data above, things look pretty messy. We have data every 5 minutes and the total consumption of a given gate up to that point. There are multiple gates across multiple columns, and the column names don't map well to our operations data. Moreover, further inspection shows that our granularity jumps around, from 5 minutes to 1 hour at some points. So, we need to work around this as we process this data.

In [4]:
# We need to find the power consumption for each gate at SFO that we care about (nearly all gates at SFO), 
# and rename that gate to match the (cleaner) gate names in the operations data above. We'll use the input
# mapping data, so let's bring that in
gates_df = pd.read_csv(input_1b_gate_to_meter_metric_mapping)

# Find the DataFrame that maps to a certain gate
# Input Format = {"calculated_gate" : ["Gate_A10"], "meter_name":["BA_A.032-001--Real Energy Total--(kWh)"]}
def get_df_for_gate(gate):
    for df in dfs:
        curr_df = dfs[df]
        if gate in curr_df.columns:
            return curr_df[['Timestamp', gate]].copy()
    print("Failed to find: %s" % gate)
    return None

# Generate a DF for each gate
gates = {}        
for index, row in tqdm(gates_df.iterrows()):
    gate_name = row['gate']
    meter_name = row['meter'].replace('\\n', '\n')
    curr_df = get_df_for_gate(meter_name)
    curr_df['Real_Timestamp'] = pd.to_datetime(curr_df['Timestamp'], infer_datetime_format=True) 
    curr_df['Gate'] = gate_name
    curr_df.rename(columns={meter_name:'Cumulative kWh'}, inplace=True)
    gates[gate_name] = curr_df    

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for index, row in tqdm(gates_df.iterrows()):


0it [00:00, ?it/s]

In [5]:
# Great, now we have our gates labeled properly and in one place. Let's take a look at one:
gates[gates_df['gate'][0]].head()

Unnamed: 0,Timestamp,Cumulative kWh,Real_Timestamp,Gate
0,9/1/19 0:05,924224.58,2019-09-01 00:05:00,Gate_D50B
1,9/1/19 0:10,924225.66,2019-09-01 00:10:00,Gate_D50B
2,9/1/19 0:15,924226.75,2019-09-01 00:15:00,Gate_D50B
3,9/1/19 0:20,924227.84,2019-09-01 00:20:00,Gate_D50B
4,9/1/19 0:25,924228.91,2019-09-01 00:25:00,Gate_D50B


Things look much cleaner! We now have a properly formatted timestamp and have reasonable gate names.

However, for each row, we need to identify:

- the granularity of that row (5 minute's worth of data? 1 hour?) 
- the average power consumed over the time period of that row

We also need to avoid the 2 hours involved with Daylight Savings Time, using the function defined in the inputs section.

In [6]:
# Format each gate per our desired specification:

def format_gate(gate_df):
    new_data = []
    last_valid = {'value': -1, 'time': ''}

    for index, row in gate_df.iterrows():
        curr_kwh = row['Cumulative kWh']
        curr_timestamp = row['Real_Timestamp']
        if curr_kwh > 0 and should_include_time(curr_timestamp):
            if last_valid['value'] == -1: # first instance
                last_valid['value'] = curr_kwh
                last_valid['time'] = curr_timestamp
            else:
                delta = curr_kwh - last_valid['value']
                time_diff = curr_timestamp - last_valid['time']
                new_data.append({
                    'Real_Timestamp': curr_timestamp,
                    'Power_kW': 3600 * (delta / time_diff.seconds),
                    'Cumulative_kWh': curr_kwh,
                    'Row_Time_Delta': time_diff,
                    'Gate': row['Gate'],
                })
                last_valid['value'] = curr_kwh
                last_valid['time'] = curr_timestamp
        
    new_dataframe = pd.DataFrame(new_data)

    power_query = 'Power_kW > ' + str(power_upper_bound_kW) + ' or Power_kW < ' + str(power_lower_bound_kW)
    indices_that_are_invalid = new_dataframe.query(power_query).index.values
    indices_to_remove = [ [ i2 for i2 in range(i - surrounding_rows_to_discard, i + surrounding_rows_to_discard)] for i in indices_that_are_invalid ]
    indices_to_remove_flat = [item for sublist in indices_to_remove for item in sublist]
    indices_to_remove_set = list(set(indices_to_remove_flat)) 
    new_dataframe_cleaned = new_dataframe.drop(index=indices_to_remove_set)
    return new_dataframe_cleaned

# Format all gates
fixed_gates = {}
for gate in tqdm(gates):
    fixed_gates[gate] = format_gate(gates[gate])

# Combine all gates  
gates_list_fixed = []
for index in fixed_gates:
    gates_list_fixed.append(fixed_gates[index])

all_gates_fixed = pd.concat(gates_list_fixed)
all_gates_fixed.head()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for gate in tqdm(gates):


  0%|          | 0/51 [00:00<?, ?it/s]

Unnamed: 0,Real_Timestamp,Power_kW,Cumulative_kWh,Row_Time_Delta,Gate
0,2019-09-01 00:10:00,12.96,924225.66,0 days 00:05:00,Gate_D50B
1,2019-09-01 00:15:00,13.08,924226.75,0 days 00:05:00,Gate_D50B
2,2019-09-01 00:20:00,13.08,924227.84,0 days 00:05:00,Gate_D50B
3,2019-09-01 00:25:00,12.84,924228.91,0 days 00:05:00,Gate_D50B
4,2019-09-01 00:30:00,12.96,924229.99,0 days 00:05:00,Gate_D50B


In [8]:
all_gates_fixed['Gate'].unique()

array(['Gate_D50B', 'Gate_D51A', 'Gate_D51B', 'Gate_D52', 'Gate_D53',
       'Gate_D54A', 'Gate_D54B', 'Gate_D55', 'Gate_D50A', 'Gate_D56A',
       'Gate_D56B', 'Gate_D57', 'Gate_D58A', 'Gate_D58B', 'Gate_A1',
       'Gate_A1B', 'Gate_A2', 'Gate_A3', 'Gate_A4', 'Gate_A5',
       'Gate_A11A', 'Gate_A6', 'Gate_A7', 'Gate_A8', 'Gate_A9',
       'Gate_A10', 'Gate_A12', 'Gate_G100', 'Gate_G91', 'Gate_G92',
       'Gate_G93', 'Gate_G94', 'Gate_G95', 'Gate_G96', 'Gate_G97',
       'Gate_G98', 'Gate_E60', 'Gate_E61', 'Gate_E62', 'Gate_E63',
       'Gate_E64', 'Gate_E65', 'Gate_E66', 'Gate_E67', 'Gate_E68',
       'Gate_E69', 'Gate_F70', 'Gate_F71A', 'Gate_F71B', 'Gate_F77B',
       'Gate_F79'], dtype=object)

In [7]:
# Now let's export this data to a CSV with our defined filename
export_csv = all_gates_fixed.to_csv(output_1c_desired_filename, index = None, header=True)