# **05 Merge EPC data**

## Objectives

* Merge the clean EPC data with the Price Paid, Geography and IMD (ppd_with_geography_and_imd) combined data to create and save the final dataset for analysis 

## Inputs

* data/clean/ppd_with_geography_and_imd.csv
* data/clean/epc_master.zip

## Outputs

* data/clean/ppd_with_geography_and_imd_epc.csv

## Additional Comments

* EPC adds floor area and EPC rating feature.
* The EPC data is in a zip file otherwise the file size is over the 100Mb limit for Github files.

---

# Read input files

In [None]:
import pandas as pd
from zipfile import ZipFile
epc = pd.read_csv("../data/clean/epc_master.zip", compression='zip', low_memory=False)

In [4]:
epc.head()

Unnamed: 0,postcode,address,current_energy_rating,total_floor_area,lodgement_date
0,B31 1UQ,47 Leyhill Farm Road,B,60.96,2008-10-01
1,CV4 9WF,"4, Niagara Close",B,47.94,2008-10-01
2,DY1 2HD,"26, Abbotsford Drive",F,139.21,2008-10-01
3,WR9 8UH,"7, Bagehott Road",D,176.94,2008-10-01
4,ST9 9DQ,"9 Waterfall Cottages, Basnetts Wood, Endon",E,131.71,2008-10-01


# Merge datasets on postcode and address using fuzzy matching

In [5]:
# Load the Price Paid, Geography and IMD data
ppd_geography_imd = pd.read_csv("../data/clean/ppd_with_geography_and_imd.csv")
ppd_geography_imd.head()

Unnamed: 0,transaction,price,transfer_date,postcode,property_type,new_build,tenure,PAON,SAON,Street,...,town_city,district,county,PPD_category,Status,lsoa11cd,msoa11nm,ladnm,IMD_Decile,IMD_Rank
0,{3DCCB7C9-D239-5B9D-E063-4704A8C0331E},340000,2003-11-28 00:00,DE73 7JQ,S,N,F,COBWEB BARN,,INGLEBY LANE,...,DERBY,SOUTH DERBYSHIRE,DERBYSHIRE,A,A,E01019843,South Derbyshire 006,South Derbyshire,7.0,20607.0
1,{3DCCB7C9-D364-5B9D-E063-4704A8C0331E},450000,2006-03-17 00:00,DE7 6GU,D,N,F,3,,BEECH LANE,...,ILKESTON,EREWASH,DERBYSHIRE,A,A,E01019703,Erewash 005,Erewash,10.0,29988.0
2,{3DCCB7CA-8C58-5B9D-E063-4704A8C0331E},350000,2001-07-19 00:00,CV4 7PA,D,N,F,1,,THE LAURELS,...,COVENTRY,COVENTRY,WEST MIDLANDS,A,A,E01009665,Coventry 042,Coventry,9.0,29510.0
3,{3DCCB7CA-1DAB-5B9D-E063-4704A8C0331E},295000,2021-05-21 00:00,LE14 3QL,D,N,F,4,,HOUGHTON CLOSE,...,MELTON MOWBRAY,MELTON,LEICESTERSHIRE,A,A,E01025884,Melton 003,Melton,6.0,16968.0
4,{3DCCB7CA-1EF1-5B9D-E063-4704A8C0331E},600000,2021-12-20 00:00,DE73 8LF,D,Y,F,10,,PRIORY CLOSE,...,DERBY,NORTH WEST LEICESTERSHIRE,LEICESTERSHIRE,A,A,E01025924,North West Leicestershire 001,North West Leicestershire,7.0,19826.0


---

# Merge datasets on postcode and address using fuzzy matching

In [None]:
# merge the datasets on the postcode and address fields
# need to fuzzy match the addresses
# for ppd use the SAON, PAON, Street and postcode fields
# for epc use address and postcode fields
# use rapidfuzz library for fuzzy matching - suggested by Copilot

import pandas as pd
from rapidfuzz import fuzz, process

ppd_geography_imd["addr_key"] = (
    ppd_geography_imd[["PAON","SAON","Street","postcode"]]
      .fillna("")
      .agg(" ".join, axis=1)
      .str.lower()
      .str.replace(r"[^a-z0-9 ]", "", regex=True)
      .str.replace(r"\s+", " ", regex=True)
      .str.strip()
)

epc["addr_key"] = (
    epc[["address","postcode"]]
      .fillna("")
      .agg(" ".join, axis=1)
      .str.lower()
      .str.replace(r"[^a-z0-9 ]", "", regex=True)
      .str.replace(r"\s+", " ", regex=True)
      .str.strip()
)

choices = epc.set_index("addr_key")[["current_energy_rating","total_floor_area"]]

def best_match(x):
    if not x:
        return pd.Series([None, None, None])
    name, score, _ = process.extractOne(x, choices.index, scorer=fuzz.WRatio)
    if score >= 90:
        return pd.Series([name, score, *choices.loc[name]])
    return pd.Series([None, None, None, None])

ppd_geography_imd[["matched_key","match_score","current_energy_rating", "total_floor_area"]] = ppd_geography_imd["addr_key"].apply(best_match)

# test the results
ppd_geography_imd.head()


---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---