## Pipeline for producing processed EPC and MCS data and merging them into one table

We're using the asf-core-data repo for the processing.

In [13]:
%load_ext autoreload
%autoreload 2

import os

from asf_core_data import generate_and_save_mcs
from asf_core_data import load_preprocessed_epc_data

from asf_core_data.getters.epc import data_batches
from asf_core_data.getters.data_getters import download_core_data

from asf_core_data.pipeline.preprocessing import preprocess_epc_data
from asf_core_data.pipeline.data_joining import install_date_computation, merge_proc_datasets



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### Processing EPC

Currently, we're still handling the EPC processing by downloading and processing it locally. In the future, this will be done directly via S3. 
For now, we need to download the raw EPC data into our local data foler.

In [14]:
LOCAL_DATA_DIR = '/path/to/data/dir'

if not os.path.exists(LOCAL_DATA_DIR):
    os.makedirs(LOCAL_DATA_DIR)

In [15]:
download_core_data('epc_raw', LOCAL_DATA_DIR, batch='newest')

In [16]:
# Check whether newest batch shows up a newest in local data dir
print("Local input dir\n---------------")
print("Available batches:", data_batches.get_all_batch_names(data_path=LOCAL_DATA_DIR, check_folder='inputs'))
print("Newest batch:", data_batches.get_most_recent_epc_batch(data_path=LOCAL_DATA_DIR))

Local input dir
---------------
Available batches: ['2021_Q2_0721', '2022_Q1_complete', '2021_Q4_0721', '2020_Q3_1220', '2022_Q3_complete']
Newest batch: 2022_Q3_complete


In [None]:
# Process new batch of EPC data
epc_full = preprocess_epc_data.load_and_preprocess_epc_data(
    data_path=LOCAL_DATA_DIR, batch="newest", subset='GB',
    reload_raw=True
)

#### Processing MCS

After processing the EPC data, it has to be uploaded to S3 again for further processing. In the future, this will happen automatically.
In order for the following code to work, you should at least upload the following file to the S3 asf-core-data bucket: `LOCAL_DATA_DIR/BATCH_NAME/EPC_GB_preprocessed.csv`

You can do this using a command as the following in your terminal:

`aws s3 cp LOCAL_DATA_DIR/outputs/EPC/preprocessed_data/2022_Q3_complete/EPC_GB_preprocessed.csv s3://asf-core-data/outputs/EPC/preprocessed_data/2022_Q3_complete/`


**Note:**
An additional step will be added here or included in `generate_and_save_mcs()`. We will need to process the MCS historical installer data and add the unique installation ID to the MCS installations.

Next, we have to process MCS data and join it with EPC. 

In [18]:
# Get MCS and join with MCS
generate_and_save_mcs(verbose=True)

Installations files
inputs/MCS/latest_raw_data/mcs_installations_2021.xlsx
inputs/MCS/latest_raw_data/mcs_installations_2022_q1.xlsx
inputs/MCS/latest_raw_data/mcs_installations_2022_q2.xlsx
inputs/MCS/latest_raw_data/mcs_installations_2022_q3.xlsx
inputs/MCS/latest_raw_data/mcs_installations_2022_q4.xlsx

Installer files
inputs/MCS/latest_raw_data/mcs_installations_2022_q1.xlsx
inputs/MCS/latest_raw_data/mcs_installer_information_2022_04_06.xlsx
inputs/MCS/latest_raw_data/mcs_installer_information_2022_07_25.xlsx
inputs/MCS/latest_raw_data/mcs_installers.xlsx
Number of records before removing duplicates: 178248




Number of records after removing duplicates: 178200
Shape of loaded data: (176524, 32)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hps["cluster"].loc[


Saved in S3: /outputs/MCS/mcs_installations_230315.csv
Getting EPC data...


#### Merging the EPC and MCS

Finally, we load the EPC data and merge it with the MCS installations data for computing the best approximation for a heat pump installation date. You can also load the data from 'S3' insteadl of the local data dir, but if you have it downloaded it's faster.

All these steps are summarised in the function `merging_pipeline()` in `merge_proc_datasets.py`. 

In [None]:
# Load the processed EPC data 
prep_epc = load_preprocessed_epc_data(data_path="S3", version='preprocessed',
                                       #usecols=['UPRN', 'INSPECTION_DATE', 'HP_INSTALLED', 'HP_TYPE'],  # use fewer fields for testing to save time
                                       batch='newest'
                                    )


In [None]:
# Add more precise estimations for heat pump installation dates via MCS data
epc_with_MCS_dates = install_date_computation.compute_hp_install_date(
    prep_epc
)

epc_with_MCS_dates.shape

The EPC data with enhanced installation dates can then be merged with MCS installation data. This will standardise features such as HP_INSTALLED and HP_TYPE.

**Note**: We're excluding two missing features ("company_unique_id", "installer_name") until the final merges are complete. 

In [None]:
epc_mcs_processed = merge_proc_datasets.add_mcs_installations_data(epc_with_MCS_dates, usecols=[
        "UPRN",
        "commission_date",
        "capacity",
        "estimated_annual_generation",
        "flow_temp",
        "tech_type",
        "scop",
        "design",
        "product_name",
        "manufacturer",
        "cost"
    ], verbose=True)
epc_mcs_processed.columns

#### ! This will fail until merge conflicts are resolved !

This section will be tested and revised after the final merges.

In [None]:
# Merge EPC/MCS with MCS installers 
epc_mcs_complete = merge_proc_datasets.add_mcs_installer_data(
    epc_mcs_processed)

epc_mcs_complete.columns

In [None]:
# Reformat postcode field to include no space
epc_mcs_complete = data_cleaning.reformat_postcode(
    epc_mcs_complete, postcode_var_name="POSTCODE", white_space="remove"
)
epc_mcs_complete['POSTCODE'].head()

In [None]:
epc_mcs_combined = merge_proc_datasets.merging_pipeline()