### Imports and setup

In [1]:
from asf_core_data import get_mcs_installations, load_preprocessed_epc_data

import pandas as pd
import numpy as np

In [2]:
# change path to your own local version of EPC data
epc_path = "/Users/chris.williamson/Documents/ASF_data"

In [3]:
mcs = get_mcs_installations("full")

Loading /outputs/MCS/mcs_installations_epc_full_230510.csv from S3


In [4]:
# convert date columns to datetime type
mcs["commission_date"] = pd.to_datetime(mcs["commission_date"])
mcs["INSPECTION_DATE"] = pd.to_datetime(mcs["INSPECTION_DATE"])

In [5]:
# merge installation type columns and filter to domestic installations
mcs["installation_type"] = mcs["installation_type"].fillna(
    mcs["end_user_installation_type"]
)
mcs = mcs.loc[mcs.installation_type == "Domestic"].reset_index(
    drop=True
)

### What proportion of domestic records in the MCS database relate to new-build installations?

In [6]:
# UPRNs of properties that have an EPC labelled as "new dwelling" (not necessarily the first one - see below)
new_uprns = [uprn for uprn in mcs.loc[mcs["TRANSACTION_TYPE"] == "new dwelling"]["UPRN"] if uprn != "unknown"]

In [7]:
# filter to first records
# records not linked to an EPC are kept
first_records = (
    mcs
    .sort_values("INSPECTION_DATE")
    .groupby("original_mcs_index")
    .head(1)
    .sort_values("original_mcs_index")
)

In [8]:
# find number of days between first recorded EPC inspection and HP commission
# could instead use the first EPC that labels the property as a "new dwelling", rather than the first overall - which is best? (see below)
first_records["diff_epc_to_mcs"] = (
    first_records["commission_date"] - first_records["INSPECTION_DATE"]
).dt.days

# assume dwelling was built with HP if:
# - it has an EPC indicating that it is a new dwelling
# - time difference between first EPC inspection and HP installation is less than 1 year
first_records["assumed_hp_when_built"] = (
    first_records["UPRN"].isin(new_uprns)
) & (first_records["diff_epc_to_mcs"] < 365)

Proportion of domestic MCS installations that relate to new builds:

In [9]:
first_records.assumed_hp_when_built.value_counts(normalize=True)

False    0.886784
True     0.113216
Name: assumed_hp_when_built, dtype: float64

Raw numbers:

In [10]:
first_records.assumed_hp_when_built.value_counts()

False    105114
True      13420
Name: assumed_hp_when_built, dtype: int64

Top 5 installers of domestic new build installations:

In [None]:
first_records.loc[first_records["assumed_hp_when_built"]]["installer_name"].value_counts().head()

Difference in average costs for domestic retrofits and new builds:

In [11]:
first_records.groupby("assumed_hp_when_built").cost.mean()

assumed_hp_when_built
False    12593.874442
True     11755.147494
Name: cost, dtype: float64

### What proportion of properties in the EPC database that were built with a HP appear in the MCS database?

In [12]:
epc = load_preprocessed_epc_data(epc_path, version="preprocessed", usecols=["UPRN", "TRANSACTION_TYPE", "INSPECTION_DATE", "HP_INSTALLED"])

In [13]:
# filter to records of new builds with a heat pump
new_hp = epc.loc[(epc["TRANSACTION_TYPE"] == "new dwelling") & (epc["HP_INSTALLED"])]

In [14]:
# replace missing or unknown UPRNs with different codes to avoid them appearing the same in both datasets
new_hp["UPRN"] = new_hp["UPRN"].replace("unknown", "unknown_epc").fillna("unknown_epc").astype("str")
mcs["UPRN"] = mcs["UPRN"].replace("unknown", "unknown_mcs").fillna("unknown_mcs").astype("str")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_hp["UPRN"] = new_hp["UPRN"].replace("unknown", "unknown_epc").fillna("unknown_epc").astype("str")


In [15]:
new_hp["in_mcs"] = new_hp["UPRN"].isin(mcs["UPRN"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_hp["in_mcs"] = new_hp["UPRN"].isin(mcs["UPRN"])


In [16]:
new_hp = new_hp.drop_duplicates("UPRN")

Proportions of EPC new builds that are in the MCS database:

In [17]:
new_hp["in_mcs"].value_counts(normalize=True)

False    0.834935
True     0.165065
Name: in_mcs, dtype: float64

Raw numbers:

In [18]:
new_hp["in_mcs"].value_counts()

False    89485
True     17691
Name: in_mcs, dtype: int64

Weird that there are more properties identified here than by starting with the MCS dataset - how can this happen?

In [19]:
assumed_hp_when_built_uprns = first_records.loc[first_records["assumed_hp_when_built"]]["UPRN"]

weird = new_hp.loc[(new_hp["in_mcs"]) & (~new_hp["UPRN"].isin(assumed_hp_when_built_uprns))]

In [None]:
mcs.loc[mcs["UPRN"].isin(weird["UPRN"])][["UPRN", "commission_date", "TRANSACTION_TYPE", "INSPECTION_DATE"]].sort_values(["UPRN", "INSPECTION_DATE"]).head(20)

In [None]:
epc.loc[epc["UPRN"].isin(weird["UPRN"])].sort_values(["UPRN", "INSPECTION_DATE"]).head(20)

These properties seem to be ones with an EPC certificate that says that the property is a "new dwelling" and has a HP and either:
* an earlier certificate with inspection date >1 year before the MCS commission date, or
* an MCS commission date that is >1 year after the EPC inspection date

The former could indicate errors in the EPC dataset (why would a new dwelling have a previous certificate?)

The latter could be properties getting their HP replaced or getting a second HP installed, or errors in the MCS dataset.

### Side note: How many "new dwelling" EPCs have an earlier certificate?

In [20]:
epc["certificate_n"] = epc.sort_values("INSPECTION_DATE").groupby("UPRN").cumcount()

In [21]:
epc.loc[(epc["UPRN"] != "unknown") & (epc["TRANSACTION_TYPE"] == "new dwelling")]["certificate_n"].value_counts()

0      2605801
1       140985
2        16907
3         2987
4          760
        ...   
256          1
253          1
273          1
271          1
267          1
Name: certificate_n, Length: 300, dtype: int64

Proportion of "new dwelling" EPCs that have an earlier certificate:

In [22]:
(epc.loc[(epc["UPRN"] != "unknown") & (epc["TRANSACTION_TYPE"] == "new dwelling")]["certificate_n"] >= 1).value_counts(normalize=True)

False    0.940835
True     0.059165
Name: certificate_n, dtype: float64

One extreme example:

In [None]:
epc.loc[epc["certificate_n"] == 273]

This one property has 300 "new dwelling" certificates:

In [None]:
epc.loc[(epc["UPRN"] == "384131") & (epc["TRANSACTION_TYPE"] == "new dwelling")]

Transaction types of first records for properties with a "new dwelling" certificate and an earlier certificate:

In [23]:
bad_uprns = epc.loc[(epc["UPRN"] != "unknown") & (epc["TRANSACTION_TYPE"] == "new dwelling") & (epc["certificate_n"] >= 1)]["UPRN"]

epc.loc[(epc["UPRN"].isin(bad_uprns)) & (epc["certificate_n"] == 0)]["TRANSACTION_TYPE"].value_counts(normalize=True)

new dwelling                 0.535386
marketed sale                0.257529
rental                       0.166368
unknown                      0.012817
non marketed sale            0.008973
FiT application              0.007918
assessment for green deal    0.004408
ECO assessment               0.003275
RHI application              0.002417
not sale or rental           0.000544
Stock Condition Survey       0.000242
following green deal         0.000124
Name: TRANSACTION_TYPE, dtype: float64

## Summary: to report to MCS

* About 11% of domestic installation records in the MCS dataset relate to new builds (13,420 records).

* The biggest installer of new build HPs is ... with ... installations. The second and third biggest installers of new build HPs are ... (... installations) and ... (... installations).

* The average reported cost of a new build installation is about £840 less than a retrofit (£11,760 compared to £12,600).

* About 83% of properties in the EPC dataset which are labelled as being new dwellings with a heat pump are not found in the MCS dataset (89,485 properties).

* Some "new dwellings with a heat pump" in the EPC dataset appear in the MCS dataset with a commissioning date that is more than 1 year after the EPC inspection date. These could be properties getting their original heat pump replaced or getting a second heat pump installed, but they could also indicate errors in the MCS dataset.