## Pipeline for producing processed EPC and MCS data and merging them into one table

We're using the asf-core-data repo for the processing.

In [24]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.insert(0, "/Users/juliasuter/Documents/repositories/asf_core_data")
import os

from asf_core_data.getters.data_getters import download_core_data
from asf_core_data import generate_and_save_mcs
from asf_core_data import load_preprocessed_epc_data
from asf_core_data.getters.epc import data_batches
from asf_core_data.pipeline.preprocessing import preprocess_epc_data, data_cleaning
from asf_core_data.pipeline.data_joining import merge_install_dates, merge_proc_datasets
from asf_core_data.getters import data_getters
from asf_core_data.config import base_config


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### Processing EPC

Currently, we're still handling the EPC processing by downloading and processing it locally. In the future, this will be done directly via S3. 
For now, we need to download the raw EPC data into our local data foler.

In [2]:
LOCAL_DATA_DIR = '/Users/juliasuter/Documents/ASF_data'

download_core_data('epc_raw', LOCAL_DATA_DIR, batch='newest')

KeyboardInterrupt: 

In [None]:
# Check whether newest batch shows up a newest in local data dir
print("Local input dir\n---------------")
print("Available batches:", data_batches.get_all_batch_names(data_path=LOCAL_DATA_DIR, check_folder='inputs'))
print("Newest batch:", data_batches.get_most_recent_batch(data_path=LOCAL_DATA_DIR))

In [None]:
# Process new batch of EPC data
epc_full = preprocess_epc_data.load_and_preprocess_epc_data(
    data_path=LOCAL_DATA_DIR, batch="newest", subset='GB',
    reload_raw=True
)

#### Processing MCS

After processing the EPC data, it has to be uploaded to S3 again for further processing. In the future, this will happen automatically.
In order for the following code to work, you should at least upload the following file to the S3 asf-core-data bucket: `LOCAL_DATA_DIR/BATCH_NAME/EPC_GB_preprocessed.csv`

You can do this using a command as the following in your terminal:

`aws s3 cp EPC_GB_preprocessed.csv s3://asf-core-data/outputs/EPC/preprocessed_data/2022_Q3_complete/`

Next, we have to process MCS data and join it with EPC. 

In [None]:
# Get MCS and join with MCS
generate_and_save_mcs(verbose=True)

#### Merging the EPC and MCS

Finally, we load the EPC data and merge it with the MCS installations data for computing the best approximation for a heat pump installation date. 

In [3]:
# Load the processed EPC data 
prep_epc = load_preprocessed_epc_data(data_path=LOCAL_DATA_DIR, version='preprocessed',
                                       #usecols=['UPRN', 'INSPECTION_DATE', 'HP_INSTALLED', 'HP_TYPE'],  # use fewer fields for testing to save time
                                       batch='newest'
                                    )


In [4]:
# Add more precise estimations for heat pump installation dates via MCS data
epc_with_MCS_dates = merge_install_dates.manage_hp_install_dates(
    prep_epc
)

epc_with_MCS_dates.shape

2023-02-28 02:11:48,413 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["FIRST_HP_MENTION"] = df[identifier].map(dict(first_hp_mention))


(240089, 59)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["MCS_AVAILABLE"] = ~df["HP_INSTALL_DATE"].isna()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["HAS_HP_AT_SOME_POINT"] = ~df["FIRST_HP_MENTION"].isna()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["ARTIFICIALLY_DUPL"] = False
A value is trying to be set on a copy of a slice from a DataF

(19047896, 64)

The EPC data with enhanced installation dates can then be merged with MCS installation data. This will standardise features such as HP_INSTALLED and HP_TYPE.

In [5]:
epc_mcs_processed = merge_proc_datasets.merge_proc_epc_and_mcs_installations(epc_with_MCS_dates)

EPC (19047896, 65)
MCS (168574, 13)
MCS matched (139748, 13)
MCS unmatched (28826, 13)
Merged with matched (19053780, 75)
Merged with matched and unmatched (19082606, 75)


In [6]:
epc_mcs_processed.shape

(19082606, 75)

In [8]:
epc_mcs_processed.to_csv("epc_mcs_processed.csv")

In [17]:
epc_mcs_processed = data_cleaning.reformat_postcode(epc_mcs_processed, postcode_var_name="POSTCODE", white_space="remove")
epc_mcs_processed['POSTCODE'].head()

0    EH193EP
1    EH222LB
2    EH209LD
3    EH222LW
4    EH224NH
Name: POSTCODE, dtype: object

In [25]:
newest_hist_inst_batch = data_batches.get_latest_hist_installers()

print(newest_hist_inst_batch)

# # Load MCS
mcs_inst_data = data_getters.load_s3_data(
    base_config.BUCKET_NAME,
    newest_hist_inst_batch,
)

mcs_inst_data.rename(columns={'postcode':'POSTCODE'}, inplace=True)
mcs_inst_data.shape

outputs/MCS/installers/mcs_historical_installers_20230207.csv


In [28]:
mcs_inst_data.head()

Unnamed: 0,company_unique_id,company_name,mcs_certificate_number,certification_body,address_1,address_2,town,county,POSTCODE,latitude,...,solar_pv_certified,wind_turbine_certified,solar_thermal_certified,battery_storage_certified,air_source_hp_certified,ground_water_source_hp_certified,hot_water_hp_certified,exhaust_air_hp_certified,gas_absorbtion_hp_certified,solar_assisted_hp_certified
0,t j galvin plumbing heating engineers,T J Galvin Plumbing & Heating Engineers,1283,MCS,Brandoak House,Stone,Berkeley,Gloucestershire,GL139LA,51.652315,...,False,False,False,False,True,True,False,False,False,False
1,paragon systems scotland,Paragon Systems (Scotland) Ltd,1286,MCS,"The Office, Corbie Cottage",Maryculter,Aberdeen,Aberdeenshire,AB125FT,57.089012,...,False,False,False,False,True,True,False,True,False,False
2,carillion energy services,Carillion Energy Services Limited,1290,MCS,"Partnership House, Regent Farm Road",Gosforth,Newcastle Upon Tyne,Tyne and Wear,NE33AF,55.010499,...,True,False,True,False,True,True,False,True,False,False
3,edwards uk,Edwards UK Ltd t/a Nugenn,1292,MCS,"Suite 2, Cumbria House",Gillwilly Road,Penrith,Cumbria,CA119FF,54.665127,...,False,False,False,False,True,True,False,True,False,False
4,jdk enterprises,JDK Enterprises Ltd t/a Solar Air UK,1294,,6 Hilltop,Stanley Road,Whitstable,Kent,CT54QE,51.346942,...,False,False,True,False,True,True,False,True,False,False
