# Gorilla Data Engineer Assessment

Use **pandas** to calculate a transportation distribution charge for four gas meters in
the United Kingdom. While solving this exercise, focus on efficiency - i.e., use vectorised operations and avoid loops!
> Transportation distribution charges are levied by gas distribution companies for the use of their
lower pressure pipelines; they cover the cost of physically transporting the gas through the
pipeline. This rate is determined depending on a meter's exit zone (gas network region) and its
estimated annual quantity (AQ); and it changes over time.

* The daily charge is calculated by finding the correct rate for each meter and day in the
forecast and multiplying this rate (in p/kWh) with the day's forecast (in kWh).
* Calculate the total cost per meter by summing its daily charges for the full forecast
period and converting to Pounds (1p = 0.01£).
* Calculate the total consumption per meter by summing its daily consumption
forecast for the full period.

In [5]:
# import sys
# print(sys.executable) 

import pandas as pd

In [6]:
# Daily Charge Calculation
"""
I: Rate, Forecast
    Rate (R)= p/kWh (pence per kWh) / panda series (dict)
    Forecast (F) = consumption in kWh / panda series (dict)
O: Cost of consumption per day (in pence) / panda series (dict)
C: Rate and forecast should be non-negative values (Absolute)
E: 
    Zero values would equal zero daily charge
    High values or floating point precision may result in overflow or loss in precision
    Non-numeric inputs should result in error
    Negative values should be flagged as invalid and result in error.
"""
def daily_charge(r, f):
    # Check edge cases
    # Non-numeric values
    if not isinstance(r, pd.Series) or not isinstance(f, pd.Series):
        raise TypeError("Rate and forecast must be a series.")
    # Non-negative values
    if (r < 0).any(): 
        raise ValueError("Rate must be non-negative.")
    if (f < 0).any(): 
        raise ValueError("Forecast must be non-negative.")

    # Calculate
    return r * f

# Tests
try:
    r = pd.Series([15.5, 16.0, 14.8, 15.2, 16.5])
    f = pd.Series([20.0, 25.0, 30.0, 15.0, 10.0])
    print(daily_charge(r,f))
except (ValueError, TypeError) as error:
    print(error)

0    310.0
1    400.0
2    444.0
3    228.0
4    165.0
dtype: float64


In [7]:
# Total Cost per Meter Calculation
"""
I: rate, forecast
O: total cost in pounds / numeric - float or int
"""

def meter_cost(r,f):
    daily_charges = daily_charge(r,f) # calculate the daily charges
    total = daily_charges.sum() # get the sum total of the daily charges - in pence
    return total * 0.01 # convert to pouns

# Tests
try:
    r = pd.Series([15.5, 16.0, 14.8, 15.2, 16.5])
    f = pd.Series([20.0, 25.0, 30.0, 15.0, 10.0])
    print(meter_cost(r,f))
except (TypeError,ValueError) as e:
    print(e)

15.47


In [9]:
# Total Consumption per Meter Calculation
"""
I: Forecast / Numpy dict
O: Comsumption in kWh / Numeric
C: Must be non-neg.
E: Needs to add own edge cases since this does not run through daily charges.
"""

def meter_consumption(f):
    # Non-numeric values
    if not isinstance(f, pd.Series):
        raise TypeError("Forecast must be an list of numeric value.")
    # Non-negative values
    if (f < 0).any(): 
        raise ValueError("Forecast must be non-negative.")
    
    # forecast - total consumption 
    return f.sum()

# Tests
try:
    f = pd.Series([20.0,25.0,30.0,15.0,10.0])
    print(meter_consumption(f))
except (TypeError,ValueError) as e:
    print(e)

100.0


In [137]:
# retrieve csv files
df_rate = pd.read_csv("Gorilla Python Assessment/data/rate.csv")
df_forecast = pd.read_csv("Gorilla Python Assessment/data/forecast.csv")
df_meter = pd.read_csv("Gorilla Python Assessment/data/meter.csv")

# I still need to consider edge cases and constraints

# combine all csv files
df_combined = pd.merge(df_meter, df_rate, on='exit_zone')
df_data = pd.merge(df_combined, df_forecast, on=['meter_id','date'], how='inner')

# calculate the daily charge
# I realize this negates all the work I did before :(
df_data['daily_charge'] = df_data['rate_p_per_kwh'] * df_data['kwh'] 

# calculate the total cost, forecast
cost_total = df_data.groupby('meter_id')['daily_charge'].sum() * 0.01
forecast_total = df_data.groupby('meter_id')['kwh'].sum()
# create one table, need to reset the index as its registering meter_id as the index due to prev concat
df_total = pd.merge(forecast_total, cost_total, on='meter_id').reset_index()

# making it pretty now
columns = {
    'meter_id': 'Meter ID', 
    'kwh': 'Total Estimated Consumption (kWh)', 
    'daily_charge': 'Total Cost (£)'
    }
df_result = df_total.rename(columns=columns).to_string(index=False)

# see the magic
print(df_result)

 Meter ID  Total Estimated Consumption (kWh)  Total Cost (£)
 14676236                         597.980140        2.178067
 34509937                        1513.258628        4.872876
 50264822                        5600.300184       17.815132
 88357331                        8863.566962       28.073736


Write a function that generates a list of random meters of any size. Examples of valid exit zones can be found in the rate table. You may randomly generate the annual quantity.

In [143]:
import numpy as np

# Generate list of random meters of any size
"""
I - An integer specifying the number of random meters to generate.
O - A dict containing meter, exit zone, aq 
C - input is non-negative, valid exit zones, aq is adjusted based on specific requirements.
E - zero or negative, extreme values, valid exit zones, range, duplicate meter_ids
"""

def gen_meter_list(m, ez, min=0, max=5000):
    id = np.random.randint(1,1e7, size=m) # generate random meters
    exit_zones = np.random.choice(ez, size=m) # select random exit zones
    aq = np.random.uniform(min,max, size=m)
    
    df_meters = pd.DataFrame({
        'meter_id': id,
        "exit_zone": exit_zones,
        'annual_quantity': aq
    })
    return df_meters

ez = df_data['exit_zone'].unique()

print(gen_meter_list(10,ez))

['EA1' 'SO1' 'NT1' 'SE2']
   meter_id exit_zone  annual_quantity
0   3558586       SO1      3793.080109
1   3307472       SE2      3155.408249
2   3741315       SE2      3138.070427
3   8021740       EA1      2398.092311
4   7213335       SO1      3121.731770
5   2293107       EA1      3485.766943
6   5419539       NT1       851.667620
7    967700       SE2      2424.881470
8   2213441       NT1      1188.789002
9   2592691       SE2      3698.174543


Write a function that generates mock consumption data given a list of meters and a start date and duration (number of days in the forecast). The data may be completely random and it doesn't have to match with the meters' annual quantities either.

In [None]:
# Generate mock consumption data
"""
I - List of meters, start date, duration (# days in forecast)
O -
C - 
E - 
"""
def gen_consuption_list(meter_list, start_date, duration):
    
    

Write a function that takes as an input a meter list and a consumption forecast table and that calculates the transportation cost table (i.e., best take your logic from task 1 and wrap it in a function). Benchmark this function using meter lists of different sizes and consumption forecasts for periods of different lengths. How does the function scale for larger sets of data? date exit_zone aq_min_kwh aq_max_kwh rate_p_per_kwh 2020-04-01 EA1 0 73200 0.2652 2020-10-01 EA1 0 73200 0.2970 2021-04-01 EA1 0 73200 0.3327 2021-10-01 EA1 0 73200 0.3726 2022-04-01 EA1 0 73200 0.4173 2022-10-01 EA1 0 73200 0.4674 2023-04-01 EA1 0 73200 0.5235 2023-10-01 EA1 0 73200 0.5863 2024-04-01 EA1 0 73200 0.6566

What are your observations after benchmarking? Are there any steps in the cost calculation that can be improved? How would you go about improving the performance of this calculation?