# Gorilla Data Engineer Assessment

Use **pandas** to calculate a transportation distribution charge for four gas meters in
the United Kingdom. While solving this exercise, focus on efficiency - i.e., use vectorised operations and avoid loops!
> Transportation distribution charges are levied by gas distribution companies for the use of their
lower pressure pipelines; they cover the cost of physically transporting the gas through the
pipeline. This rate is determined depending on a meter's exit zone (gas network region) and its
estimated annual quantity (AQ); and it changes over time.

* The daily charge is calculated by finding the correct rate for each meter and day in the
forecast and multiplying this rate (in p/kWh) with the day's forecast (in kWh).
* Calculate the total cost per meter by summing its daily charges for the full forecast
period and converting to Pounds (1p = 0.01£).
* Calculate the total consumption per meter by summing its daily consumption
forecast for the full period.

In [17]:
# import sys
# print(sys.executable) 

import pandas as pd

In [18]:
# Daily Charge Calculation
"""
I: Rate, Forecast
    Rate (R)= p/kWh (pence per kWh) / panda series (dict)
    Forecast (F) = consumption in kWh / panda series (dict)
O: Cost of consumption per day (in pence) / panda series (dict)
C: Rate and forecast should be non-negative values (Absolute)
E: 
    Zero values would equal zero daily charge
    High values or floating point precision may result in overflow or loss in precision
    Non-numeric inputs should result in error
    Negative values should be flagged as invalid and result in error.
"""
def daily_charge(r, f):
    # Check edge cases
    # Non-numeric values
    if not isinstance(r, pd.Series) or not isinstance(f, pd.Series):
        raise TypeError("Rate and forecast must be a series.")
    # Non-negative values
    if (r < 0).any(): 
        raise ValueError("Rate must be non-negative.")
    if (f < 0).any(): 
        raise ValueError("Forecast must be non-negative.")

    # Calculate
    return r * f

# Tests
try:
    r = pd.Series([15.5, 16.0, 14.8, 15.2, 16.5])
    f = pd.Series([20.0, 25.0, 30.0, 15.0, 10.0])
    print(daily_charge(r,f))
except (ValueError, TypeError) as error:
    print(error)

0    310.0
1    400.0
2    444.0
3    228.0
4    165.0
dtype: float64


In [19]:
# Total Cost per Meter Calculation
"""
I: rate, forecast
O: total cost in pounds / numeric - float or int
"""

def meter_cost(r,f):
    daily_charges = daily_charge(r,f) # calculate the daily charges
    total = daily_charges.sum() # get the sum total of the daily charges - in pence
    return total * 0.01 # convert to pouns

# Tests
try:
    r = pd.Series([15.5, 16.0, 14.8, 15.2, 16.5])
    f = pd.Series([20.0, 25.0, 30.0, 15.0, 10.0])
    print(meter_cost(r,f))
except (TypeError,ValueError) as e:
    print(e)

15.47


In [20]:
# Total Consumption per Meter Calculation
"""
I: Forecast / Numpy dict
O: Comsumption in kWh / Numeric
C: Must be non-neg.
E: Needs to add own edge cases since this does not run through daily charges.
"""

def meter_consumption(f):
    # Non-numeric values
    if not isinstance(f, pd.Series):
        raise TypeError("Forecast must be an list of numeric value.")
    # Non-negative values
    if (f < 0).any(): 
        raise ValueError("Forecast must be non-negative.")
    
    # forecast - total consumption 
    return f.sum()

# Tests
try:
    f = pd.Series([20.0,25.0,30.0,15.0,10.0])
    print(meter_consumption(f))
except (TypeError,ValueError) as e:
    print(e)

100.0


In [21]:
# retrieve csv files
df_rate = pd.read_csv("./data/rate.csv")
df_forecast = pd.read_csv("./data/forecast.csv")
df_meter = pd.read_csv("./data/meter.csv")

# I still need to consider edge cases and constraints !!!

# combine all csv files
df_combined = pd.merge(df_meter, df_rate, on='exit_zone')
df_data = pd.merge(df_combined, df_forecast, on=['meter_id','date'], how='inner')

# calculate the daily charge
# The moment, I realize this negates past functions
df_data['daily_charge'] = df_data['rate_p_per_kwh'] * df_data['kwh'] 

# calculate the total cost, forecast
cost_total = df_data.groupby('meter_id')['daily_charge'].sum() * 0.01
forecast_total = df_data.groupby('meter_id')['kwh'].sum()

# create one table, need to reset the index as its registering meter_id as the index due to prev concat
df_total = pd.merge(forecast_total, cost_total, on='meter_id').reset_index()

# making it pretty now
columns = {
    'meter_id': 'Meter ID', 
    'kwh': 'Total Estimated Consumption (kWh)', 
    'daily_charge': 'Total Cost (£)'
    }
df_result = df_total.rename(columns=columns).to_string(index=False)

# see the magic
print(df_result)

 Meter ID  Total Estimated Consumption (kWh)  Total Cost (£)
 14676236                         597.980140        2.178067
 34509937                        1513.258628        4.872876
 50264822                        5600.300184       17.815132
 88357331                        8863.566962       28.073736


Write a function that generates a list of random meters of any size. Examples of valid exit zones can be found in the rate table. You may randomly generate the annual quantity.

In [24]:
import numpy as np

# Generate list of random meters of any size
"""
I - An integer specifying the number of random meters to generate.
O - A dict containing meter, exit zone, aq 
C - input is non-negative, valid exit zones, aq is adjusted based on specific requirements.
E - zero or negative, extreme values, valid exit zones, range, duplicate meter_ids
"""

def gen_meter_list(m, ez, min=0, max=5000):
    id = np.random.randint(1,1e7, size=m) # generate random meters
    exit_zones = np.random.choice(ez, size=m) # select random exit zones
    aq = np.random.uniform(min,max, size=m)
    
    df_meters = pd.DataFrame({
        'meter_id': id,
        "exit_zone": exit_zones,
        'annual_quantity': aq
    })
    return df_meters

ez = df_rate['exit_zone'].unique() # this is just to get only unique exit zones in existing data.
# meter = gen_meter_list(len(ez),ez)

Write a function that generates mock consumption data given a list of meters and a start date and duration (number of days in the forecast). The data may be completely random and it doesn't have to match with the meters' annual quantities either.

In [27]:
# Generate mock consumption data
"""
I - List of meters (DF,dict), start date(str), duration (num - # days in forecast)
O - Consumption data (DF) - meter, date, kwh
C - meter list must be valid, start date must be valid, the duration or period should be valid number of dates, the consumption data range should be appropriate use case, 
E - meter list is empty, single meters, duration is only 1 day, negative duration, non-df meters
"""

def gen_consumption_list(meter_list, start_date, periods):
    # make sure start date is in the correct format using Panda Datetime stamp
    if isinstance(start_date, str):
        start_date = pd.Timestamp(start_date)
    else:
        raise TypeError("Start Date must be a string and a valid date YYYY-MM-DD format.")

    # generate forecast periods using duration by Day
    dates = pd.date_range(start=start_date, periods=periods, freq='D')

    # print(dates)

    # get generated meter list (meter_id, dates, kwh)
    meters = np.array([meter_list['meter_id']]).flatten()
    dates = np.tile(dates, len(meters))
    consumption = np.random.uniform(0,5000,size=(len(meter_list),periods)).flatten() # might need to consider what the aq_min and aq_max ranges are based on the date (?)
    
    # print(meters, len(dates), len(consumption))

    # get consumption data
    df_consumption = pd.DataFrame({
        'meter_id': np.repeat(meters,periods),
        'date': dates,
        'kwh': consumption
    })
    
    return df_consumption
    
gen_consumption_list(df_meter, '2024-01-01', 30)

Unnamed: 0,meter_id,date,kwh
0,14676236,2024-01-01,4211.423873
1,14676236,2024-01-02,2248.770667
2,14676236,2024-01-03,1975.751180
3,14676236,2024-01-04,4633.294329
4,14676236,2024-01-05,3636.359979
...,...,...,...
115,88357331,2024-01-26,753.587720
116,88357331,2024-01-27,2540.993884
117,88357331,2024-01-28,3479.064034
118,88357331,2024-01-29,4291.794024


Write a function that takes as an input a meter list and a consumption forecast table and that calculates the transportation cost table (i.e., best take your logic from task 1 and wrap it in a function). Benchmark this function using meter lists of different sizes and consumption forecasts for periods of different lengths.

In [29]:
# Function to calculate transportation cost table & Benchmark
"""
I - Meter list (meter, exit zone, aq columns), consumption table (meter_id, date, kwh)
O - Transportation cost table (meter_id, total consumption, total cost)
C - valid meter list, valid consumption forecast, valid rate table
E - empty dataframes, non-unique meters, missing data (i.e. NaN), inconsistent data consumption not in meter, neg values in rates or consumption
"""

    # if not isinstance(r, pd.Series) or not isinstance(f, pd.Series):
    #     raise TypeError("Rate and forecast must be a series.")
    # # Non-negative values
    # if (r < 0).any(): 
    #     raise ValueError("Rate must be non-negative.")
    # if (f < 0).any(): 
    #     raise ValueError("Forecast must be non-negative.")


def calc_cost(meter, consumption):
    # check edge cases and constraints
        # is there any missing data (i.e. NaN)?
        # is it a valid consumption table?
        # is the data frame empty?
        # are all the meters unique in the list?
        # is there any missing data (i.e. NaN)?
    # thinking about ways to check if data consumption is inconsistent
    
    # combine all data from params
    meters_and_consumption = pd.merge(meter, consumption, on='meter_id')

    # need rate_p_per_kwh
    meters_and_consumption['rate_p_per_kwh'] = meters_and_consumption['annual_quantity'] / meters_and_consumption['kwh']
    
    # calc the daily charge
    meters_and_consumption['daily_charge'] = meters_and_consumption['rate_p_per_kwh'] * meters_and_consumption['kwh']

    # calc total cost and total consumption per meter
    cost_total = meters_and_consumption.groupby('meter_id')['daily_charge'].sum() * 0.01
    consumption_total = meters_and_consumption.groupby('meter_id')['kwh'].sum()

    # # create the data frame
    df_total = pd.merge(consumption_total, cost_total, on='meter_id').reset_index()

    # # making it pretty
    columns = {
        'meter_id': 'Meter ID', 
        'kwh': 'Total Estimated Consumption (kWh)', 
        'daily_charge': 'Total Cost (£)'
    }

    df_result = df_total.rename(columns=columns).to_string(index=False)
    
    # return transportation_cost
    return df_result

m = gen_meter_list(len(ez),ez)
c = gen_consumption_list(m, '2024-01-01', 30)

# see the magic
print(calc_cost(m, c))

 Meter ID  Total Estimated Consumption (kWh)  Total Cost (£)
   647297                       79416.995452      715.043213
   800102                       69129.243198     1120.313574
   937842                       77173.043453     1352.063474
  1180888                       76045.569051      894.728212
  1207069                       63694.833783      445.303912
  1208053                       64891.675821      914.747708
  1367224                       75333.383380     1139.016492
  1386981                       77197.584054      945.313122
  1549303                       75621.011714      591.547986
  1896221                       78739.987697       22.000586
  2428082                       74795.759945      813.545497
  2843941                       68820.878441     1303.350938
  3040268                       82845.703396      570.629556
  3122952                       68528.347375      824.121358
  3138745                       75800.715061      880.592562
  3184791               

In [40]:
import timeit
import pandas as pd
import numpy as np

# creating large sample data for testing
def gen_meter_list (size, seed):
    np.random.seed(seed)
    id = np.arange(1, size + 1)
    aq = np.random.randint(1000,5000,size)
    return pd.DataFrame({
        'meter_id': id,
        'annual_quantity': aq
    })

def gen_consumption_list(meter_df, start_date, days):
    meter_ids = meter_df['meter_id'].values
    date_range = pd.date_range(start=start_date, periods=days)
    consumption_data = {
        'meter_id': np.random.choice(meter_ids, size=days * len(meter_ids)),
        'date': np.tile(date_range, len(meter_ids)),
        'kwh': np.random.randint(10, 100, size=days * len(meter_ids))
    }
    return pd.DataFrame(consumption_data)

def benchmark(size, seed=42, days=30):
    m = gen_meter_list(size, seed)
    # print (m)
    c = gen_consumption_list(m, '2024-01-01', days)

    # execution time
    exe_time = timeit.timeit(lambda: calc_cost(m,c), number=10)
    avg_time = exe_time / 10

    return avg_time

benchmark(100)

# testing the test
sizes = [100, 1000, 10000, 100000, 500000, 1000000]
results = []

for size in sizes:
    runtime = benchmark(size)
    results.append((size,execution_time))
    print(f"Size: {size}, Avg Runtime: {runtime:.4f} seconds")

Size: 100, Avg Runtime: 0.0044 seconds
Size: 1000, Avg Runtime: 0.0167 seconds
Size: 10000, Avg Runtime: 0.1257 seconds
Size: 100000, Avg Runtime: 1.3260 seconds
Size: 500000, Avg Runtime: 7.1258 seconds
Size: 1000000, Avg Runtime: 14.3119 seconds


> **How does this function scale for larger sets of data?**

After testing the runtime and memory, the input data size determines the execution time and the memory usage will increase. Its complexity is primarily driven by merging data frames, calculations per row, and grouping and summing the data which is resource intensive for larger data sets.

> **What are your observations after benchmarking?**

The execution time increases linearly with the size of the data set. It's efficient with smaller data sets, but larger data sets the execution time increases significantly.

> **Are there any steps in the cost calculation that can be improved?**

My considerations on steps to improve the cost calculation:
* Merging large data frames is time consuming and memory intensive.
* Vectorized calcuations (pandas) significantly improves the performance from my initial findings.
* Avoiding redundancy and reusing reduces computation time.

> **How would you go about improving the performance of this calculation?**

I'd be curious to check the memory usage as well for further improvements, but for some reason, memory_profiler was not working for me (though I'm curious to figure out why).
1. Find and ensure all operations are vectorized utilizing Pandas, by learning more about utilizing pandas and avoid loops for row operations.
2. Using join keys and check indexing for optimize dataframe merge.
3. Drop unnecessary columns that are not needed for the calculation.
4. Avoiding redundancy and reusing reduces computation time.
5. Possibly also see if we can use dask dataframes?