## Pipeline for producing processed EPC and MCS data and merging them into one table

We're using the asf-core-data repo for the processing.

In [4]:
%load_ext autoreload
%autoreload 2

import os

from asf_core_data.getters.data_getters import download_core_data
from asf_core_data import generate_and_save_mcs
from asf_core_data import load_preprocessed_epc_data
from asf_core_data.getters.epc import data_batches
from asf_core_data.pipeline.preprocessing import preprocess_epc_data, data_cleaning
from asf_core_data.pipeline.data_joining import merge_install_dates, merge_proc_datasets
from asf_core_data.getters import data_getters
from asf_core_data.config import base_config


2023-02-28 11:56:52,876 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials


#### Processing EPC

Currently, we're still handling the EPC processing by downloading and processing it locally. In the future, this will be done directly via S3. 
For now, we need to download the raw EPC data into our local data foler.

In [21]:
LOCAL_DATA_DIR = '/path/to/dir'

if not os.path.exists(LOCAL_DATA_DIR):
    os.makedirs(LOCAL_DATA_DIR)

In [None]:
download_core_data('epc_raw', LOCAL_DATA_DIR, batch='newest')

In [6]:
# Check whether newest batch shows up a newest in local data dir
print("Local input dir\n---------------")
print("Available batches:", data_batches.get_all_batch_names(data_path=LOCAL_DATA_DIR, check_folder='inputs'))
print("Newest batch:", data_batches.get_most_recent_batch(data_path=LOCAL_DATA_DIR))

Local input dir
---------------
Available batches: ['2022_Q3_complete']
Newest batch: 2022_Q3_complete


In [7]:
# Process new batch of EPC data
epc_full = preprocess_epc_data.load_and_preprocess_epc_data(
    data_path=LOCAL_DATA_DIR, batch="newest", subset='GB',
    reload_raw=True
)

Saving raw data to /Users/juliasuter/Documents/My_ASF_data/outputs/EPC/preprocessed_data/2022_Q3_complete/EPC_GB_raw.csv



KeyboardInterrupt: 

#### Processing MCS

After processing the EPC data, it has to be uploaded to S3 again for further processing. In the future, this will happen automatically.
In order for the following code to work, you should at least upload the following file to the S3 asf-core-data bucket: `LOCAL_DATA_DIR/BATCH_NAME/EPC_GB_preprocessed.csv`

You can do this using a command as the following in your terminal:

`aws s3 cp EPC_GB_preprocessed.csv s3://asf-core-data/outputs/EPC/preprocessed_data/2022_Q3_complete/`


**Note:**
An additional step will be added here or included in `generate_and_save_mcs()`. We will need to process the MCS historical installer data and add the unique installation ID to the MCS installations.

Next, we have to process MCS data and join it with EPC. 

In [14]:
# Get MCS and join with MCS
generate_and_save_mcs(verbose=True)

Installations files
inputs/MCS/latest_raw_data/mcs_installations_2021.xlsx
inputs/MCS/latest_raw_data/mcs_installations_2022_q1.xlsx
inputs/MCS/latest_raw_data/mcs_installations_2022_q2.xlsx
inputs/MCS/latest_raw_data/mcs_installations_2022_q3.xlsx

Installer files
inputs/MCS/latest_raw_data/mcs_installations_2022_q1.xlsx
inputs/MCS/latest_raw_data/mcs_installer_information_2022_04_06.xlsx
inputs/MCS/latest_raw_data/mcs_installer_information_2022_07_25.xlsx
inputs/MCS/latest_raw_data/mcs_installers.xlsx
Number of records before removing duplicates: 170275


  concat_installations = pd.concat(installations_dfs)


Number of records after removing duplicates: 170237
Shape of loaded data: (168574, 31)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  hps["cluster"].loc[


Saved in S3: /outputs/MCS/mcs_installations_230228.csv
Getting EPC data...
Forming a matching...
- Forming an index...
- Forming a comparison...
- Computing a matching...
Joining the data...
After joining:
-----------------
Total records: 267336
Number matched with EPC: 238510


Saved in S3: /outputs/MCS/mcs_installations_epc_full_230228.csv


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  joined_df["last_epc_before_mcs"].iloc[last_epc_before_mcs_indices] = True


Saved in S3: /outputs/MCS/mcs_installations_epc_most_relevant_230228.csv


#### Merging the EPC and MCS

Finally, we load the EPC data and merge it with the MCS installations data for computing the best approximation for a heat pump installation date. You can also load the data from 'S3' insteadl of the local data dir, but if you have it downloaded it's faster.

All these steps are summarised in the function `merging_pipeline()` in `merge_proc_datasets.py`. 

In [22]:
# Load the processed EPC data 
prep_epc = load_preprocessed_epc_data(data_path=LOCAL_DATA_DIR, version='preprocessed',
                                       #usecols=['UPRN', 'INSPECTION_DATE', 'HP_INSTALLED', 'HP_TYPE'],  # use fewer fields for testing to save time
                                       batch='newest'
                                    )


In [None]:
# Add more precise estimations for heat pump installation dates via MCS data
epc_with_MCS_dates = merge_install_dates.manage_hp_install_dates(
    prep_epc
)

epc_with_MCS_dates.shape

2023-02-28 02:11:48,413 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["FIRST_HP_MENTION"] = df[identifier].map(dict(first_hp_mention))


(240089, 59)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["MCS_AVAILABLE"] = ~df["HP_INSTALL_DATE"].isna()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["HAS_HP_AT_SOME_POINT"] = ~df["FIRST_HP_MENTION"].isna()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["ARTIFICIALLY_DUPL"] = False
A value is trying to be set on a copy of a slice from a DataF

(19047896, 64)

The EPC data with enhanced installation dates can then be merged with MCS installation data. This will standardise features such as HP_INSTALLED and HP_TYPE.

In [None]:
epc_mcs_processed = merge_proc_datasets.merge_proc_epc_and_mcs_installations(epc_with_MCS_dates, verbose=True)
epc_mcs_processed.shape

NameError: name 'epc_with_MCS_dates' is not defined

Get historical installer data (and finally merge it with the rest)

In [None]:
newest_hist_inst_batch = data_batches.get_latest_hist_installers()

print(newest_hist_inst_batch)

# # Load MCS
mcs_inst_data = data_getters.load_s3_data(
    base_config.BUCKET_NAME,
    newest_hist_inst_batch,
)
mcs_inst_data.head()

outputs/MCS/installers/mcs_historical_installers_20230207.csv


Unnamed: 0,company_unique_id,company_name,mcs_certificate_number,certification_body,address_1,address_2,town,county,postcode,latitude,...,solar_pv_certified,wind_turbine_certified,solar_thermal_certified,battery_storage_certified,air_source_hp_certified,ground_water_source_hp_certified,hot_water_hp_certified,exhaust_air_hp_certified,gas_absorbtion_hp_certified,solar_assisted_hp_certified
0,t j galvin plumbing heating engineers,T J Galvin Plumbing & Heating Engineers,1283,MCS,Brandoak House,Stone,Berkeley,Gloucestershire,GL139LA,51.652315,...,False,False,False,False,True,True,False,False,False,False
1,paragon systems scotland,Paragon Systems (Scotland) Ltd,1286,MCS,"The Office, Corbie Cottage",Maryculter,Aberdeen,Aberdeenshire,AB125FT,57.089012,...,False,False,False,False,True,True,False,True,False,False
2,carillion energy services,Carillion Energy Services Limited,1290,MCS,"Partnership House, Regent Farm Road",Gosforth,Newcastle Upon Tyne,Tyne and Wear,NE33AF,55.010499,...,True,False,True,False,True,True,False,True,False,False
3,edwards uk,Edwards UK Ltd t/a Nugenn,1292,MCS,"Suite 2, Cumbria House",Gillwilly Road,Penrith,Cumbria,CA119FF,54.665127,...,False,False,False,False,True,True,False,True,False,False
4,jdk enterprises,JDK Enterprises Ltd t/a Solar Air UK,1294,,6 Hilltop,Stanley Road,Whitstable,Kent,CT54QE,51.346942,...,False,False,True,False,True,True,False,True,False,False
