# Gorilla Data Engineer Assessment

Use **pandas** to calculate a transportation distribution charge for four gas meters in
the United Kingdom. While solving this exercise, focus on efficiency - i.e., use vectorised operations and avoid loops!
> Transportation distribution charges are levied by gas distribution companies for the use of their
lower pressure pipelines; they cover the cost of physically transporting the gas through the
pipeline. This rate is determined depending on a meter's exit zone (gas network region) and its
estimated annual quantity (AQ); and it changes over time.

* The daily charge is calculated by finding the correct rate for each meter and day in the
forecast and multiplying this rate (in p/kWh) with the day's forecast (in kWh).
* Calculate the total cost per meter by summing its daily charges for the full forecast
period and converting to Pounds (1p = 0.01£).
* Calculate the total consumption per meter by summing its daily consumption
forecast for the full period.

In [2]:
# import sys
# print(sys.executable) 

import pandas as pd

In [3]:
# Daily Charge Calculation
"""
I: Rate, Forecast
    Rate (R)= p/kWh (pence per kWh) / panda series (dict)
    Forecast (F) = consumption in kWh / panda series (dict)
O: Cost of consumption per day (in pence) / panda series (dict)
C: Rate and forecast should be non-negative values (Absolute)
E: 
    Zero values would equal zero daily charge
    High values or floating point precision may result in overflow or loss in precision
    Non-numeric inputs should result in error
    Negative values should be flagged as invalid and result in error.
"""
def daily_charge(r, f):
    # Check edge cases
    # Non-numeric values
    if not isinstance(r, pd.Series) or not isinstance(f, pd.Series):
        raise TypeError("Rate and forecast must be a series.")
    # Non-negative values
    if (r < 0).any(): 
        raise ValueError("Rate must be non-negative.")
    if (f < 0).any(): 
        raise ValueError("Forecast must be non-negative.")

    # Calculate
    return r * f

# Tests
try:
    r = pd.Series([15.5, 16.0, 14.8, 15.2, 16.5])
    f = pd.Series([20.0, 25.0, 30.0, 15.0, 10.0])
    print(daily_charge(r,f))
except (ValueError, TypeError) as error:
    print(error)

0    310.0
1    400.0
2    444.0
3    228.0
4    165.0
dtype: float64


In [4]:
# Total Cost per Meter Calculation
"""
I: rate, forecast
O: total cost in pounds / numeric - float or int
"""

def meter_cost(r,f):
    daily_charges = daily_charge(r,f) # calculate the daily charges
    total = daily_charges.sum() # get the sum total of the daily charges - in pence
    return total * 0.01 # convert to pouns

# Tests
try:
    r = pd.Series([15.5, 16.0, 14.8, 15.2, 16.5])
    f = pd.Series([20.0, 25.0, 30.0, 15.0, 10.0])
    print(meter_cost(r,f))
except (TypeError,ValueError) as e:
    print(e)

15.47


In [5]:
# Total Consumption per Meter Calculation
"""
I: Forecast / Numpy dict
O: Comsumption in kWh / Numeric
C: Must be non-neg.
E: Needs to add own edge cases since this does not run through daily charges.
"""

def meter_consumption(f):
    # Non-numeric values
    if not isinstance(f, pd.Series):
        raise TypeError("Forecast must be an list of numeric value.")
    # Non-negative values
    if (f < 0).any(): 
        raise ValueError("Forecast must be non-negative.")
    
    # forecast - total consumption 
    return f.sum()

# Tests
try:
    f = pd.Series([20.0,25.0,30.0,15.0,10.0])
    print(meter_consumption(f))
except (TypeError,ValueError) as e:
    print(e)

100.0


In [19]:
# retrieve csv files
df_rate = pd.read_csv("./data/rate.csv")
df_forecast = pd.read_csv("./data/forecast.csv")
df_meter = pd.read_csv("./data/meter.csv")

# I still need to consider edge cases and constraints !!!

# combine all csv files
df_combined = pd.merge(df_meter, df_rate, on='exit_zone')
df_data = pd.merge(df_combined, df_forecast, on=['meter_id','date'], how='inner')

# calculate the daily charge
# The moment, I realize this negates past functions
df_data['daily_charge'] = df_data['rate_p_per_kwh'] * df_data['kwh'] 

# calculate the total cost, forecast
cost_total = df_data.groupby('meter_id')['daily_charge'].sum() * 0.01
forecast_total = df_data.groupby('meter_id')['kwh'].sum()

# create one table, need to reset the index as its registering meter_id as the index due to prev concat
df_total = pd.merge(forecast_total, cost_total, on='meter_id').reset_index()

# making it pretty now
columns = {
    'meter_id': 'Meter ID', 
    'kwh': 'Total Estimated Consumption (kWh)', 
    'daily_charge': 'Total Cost (£)'
    }
df_result = df_total.rename(columns=columns).to_string(index=False)

# see the magic
print(df_result)

Write a function that generates a list of random meters of any size. Examples of valid exit zones can be found in the rate table. You may randomly generate the annual quantity.

In [20]:
import numpy as np

# Generate list of random meters of any size
"""
I - An integer specifying the number of random meters to generate.
O - A dict containing meter, exit zone, aq 
C - input is non-negative, valid exit zones, aq is adjusted based on specific requirements.
E - zero or negative, extreme values, valid exit zones, range, duplicate meter_ids
"""

def gen_meter_list(m, ez, min=0, max=5000):
    id = np.random.randint(1,1e7, size=m) # generate random meters
    exit_zones = np.random.choice(ez, size=m) # select random exit zones
    aq = np.random.uniform(min,max, size=m)
    
    df_meters = pd.DataFrame({
        'meter_id': id,
        "exit_zone": exit_zones,
        'annual_quantity': aq
    })
    return df_meters

ez = df_data['exit_zone'].unique() # this is just to get only unique exit zones in existing data.

gen_meter_list(10,ez)

Unnamed: 0,meter_id,exit_zone,annual_quantity
0,1223080,EA1,4770.855075
1,2147194,SO1,4300.476778
2,6093783,NT1,2929.171984
3,905181,NT1,544.552281
4,4045470,EA1,1402.572638
5,9541229,SO1,4029.49781
6,9300384,SO1,3650.12182
7,6275324,NT1,2294.263312
8,9588418,NT1,681.447413
9,630677,NT1,2395.864377


Write a function that generates mock consumption data given a list of meters and a start date and duration (number of days in the forecast). The data may be completely random and it doesn't have to match with the meters' annual quantities either.

In [65]:
# Generate mock consumption data
"""
I - List of meters (DF,dict), start date(str), duration (num - # days in forecast)
O - Consumption data (DF) - meter, date, kwh
C - 
E - 
"""

def gen_consumption_list(meter_list, start_date, periods):
    # make sure start date is in the correct format using Panda Datetime stamp
    if isinstance(start_date, str):
        start_date = pd.Timestamp(start_date)
    else:
        raise TypeError("Start Date must be a string and a valid date YYYY-MM-DD format.")

    # generate forecast periods using duration by Day
    dates = pd.date_range(start=start_date, periods=periods, freq='D')

    # print(dates)

    # get generated meter list (meter_id, dates, kwh)
    meters = np.array([meter_list['meter_id']]).flatten()
    dates = np.tile(dates, len(meters))
    consumption = np.random.uniform(0,5000,size=(len(meter_list),periods)).flatten() # might need to consider what the aq_min and aq_max ranges are based on the date (?)
    
    # print(meters, len(dates), len(consumption))

    # get consumption data
    df_consumption = pd.DataFrame({
        'meter_id': np.repeat(meters,periods),
        'date': dates,
        'kwh': consumption
    })
    
    return df_consumption

random_meter_list = gen_meter_list(10,ez)
print(gen_consumption_list(random_meter_list, '2024-01-01', 30))
    

[ 911657 4167723 8625686 2180497 9082808 3932664 7970100 6097900 9582118
 1715086] 300 300
     meter_id       date          kwh
0      911657 2024-01-01  3866.754500
1      911657 2024-01-02   109.183014
2      911657 2024-01-03  4195.311913
3      911657 2024-01-04  4888.334508
4      911657 2024-01-05  2131.522076
..        ...        ...          ...
295   1715086 2024-01-26  4800.992004
296   1715086 2024-01-27  1939.701367
297   1715086 2024-01-28  4395.142058
298   1715086 2024-01-29  2759.023101
299   1715086 2024-01-30  2749.087131

[300 rows x 3 columns]


Write a function that takes as an input a meter list and a consumption forecast table and that calculates the transportation cost table (i.e., best take your logic from task 1 and wrap it in a function). Benchmark this function using meter lists of different sizes and consumption forecasts for periods of different lengths.

In [None]:
# Function to calculate transportation cost table & Benchmark
"""
I - Meter list, consumption forecast table (dict)
O - Transportation cost table 
C -
E -
"""

* **How does this function scale for larger sets of data?**
* **What are your observations after benchmarking?**
* **Are there any steps in the cost calculation that can be improved?**
* **How would you go about improving the performance of this calculation?**