# UK Debt Analysis - DMO Data Cleaning

This notebook demonstrates how to clean and process UK Debt DMO data, applying string functions, numerical transformations, and filtering to create a clean dataset.

</br></br>

First we load the dataset, instructing `pandas` to skip the first rows, which just contain metadata.

In [120]:
import pandas as pd

# Load the data and skip the first 12 rows
df = pd.read_csv("https://github.com/RDeconomist/RDeconomist.github.io/raw/refs/heads/main/data/GiltsInIssueRaw.csv", encoding='ISO-8859-1', skiprows=8)
df.head()

Unnamed: 0,Conventional Gilts,ISIN Code,Redemption Date,First Issue Date,Dividend Dates,Current/Next \nEx-dividend Date,Total Amount in Issue \n(£ million nominal),Unnamed: 7,Unnamed: 8
0,Ultra-Short,,,,,,,,
1,0 1/8% Treasury Gilt 2023,GB00BL68HG94,31-Jan-2023,08-Apr-2020,31 Jan/Jul,20-Jan-2023,33824.0,,
2,0¾% Treasury Gilt 2023,GB00BF0HZ991,22-Jul-2023,20-Jul-2017,22 Jan/Jul,12-Jan-2023,34359.0,,
3,2¼% Treasury Gilt 2023,GB00B7Z53659,07-Sep-2023,12-Jun-2013,7 Mar/Sep,24-Feb-2023,35922.0,,
4,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551.0,,


<br><br>
## Step 1: Renaming columns 
We rename the columns to reflect the data they contain.

In [121]:

# Rename columns
df.columns = ['type', 'code', 'redemptionDate', 'issueDate', 'divDates', 'divNextDate', 'issueAmount', 'baseRPI', 'amountINCinfUplift']
df.head()


Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift
0,Ultra-Short,,,,,,,,
1,0 1/8% Treasury Gilt 2023,GB00BL68HG94,31-Jan-2023,08-Apr-2020,31 Jan/Jul,20-Jan-2023,33824.0,,
2,0¾% Treasury Gilt 2023,GB00BF0HZ991,22-Jul-2023,20-Jul-2017,22 Jan/Jul,12-Jan-2023,34359.0,,
3,2¼% Treasury Gilt 2023,GB00B7Z53659,07-Sep-2023,12-Jun-2013,7 Mar/Sep,24-Feb-2023,35922.0,,
4,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551.0,,


<br><br>
## Step 2: Dropping unnecessary rows
We drop rows with missing or invalid values in the columns `type`, `code`, and `redemptionDate`.

In [122]:

# Drop empty rows based on type, code, and redemptionDate
df = df[(df['type'] != "") & (df['code'] != "ISIN Code") & (df['code'] != "") & (df['redemptionDate'] != "")]
df.head()


Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift
0,Ultra-Short,,,,,,,,
1,0 1/8% Treasury Gilt 2023,GB00BL68HG94,31-Jan-2023,08-Apr-2020,31 Jan/Jul,20-Jan-2023,33824.0,,
2,0¾% Treasury Gilt 2023,GB00BF0HZ991,22-Jul-2023,20-Jul-2017,22 Jan/Jul,12-Jan-2023,34359.0,,
3,2¼% Treasury Gilt 2023,GB00B7Z53659,07-Sep-2023,12-Jun-2013,7 Mar/Sep,24-Feb-2023,35922.0,,
4,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551.0,,


 <br><br>
## Step 3: Extracting the Gilt Types
We split at the first 'T' or 'I' to extract the Gilt type.

In [123]:
df['giltType'] = df['type'].str.split(r'(T|I)', n=1) # Split the string starting at 'T' or 'I', keeping everything after it
df['giltType'] = df['giltType'].apply(lambda x: ''.join(x[1:]) if isinstance(x, list) and len(x) > 1 else pd.NA) # Apply logic to return the part from the first occurrence of 'T' or 'I', or NA if not found
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,giltType
0,Ultra-Short,,,,,,,,,
1,0 1/8% Treasury Gilt 2023,GB00BL68HG94,31-Jan-2023,08-Apr-2020,31 Jan/Jul,20-Jan-2023,33824.0,,,Treasury Gilt 2023
2,0¾% Treasury Gilt 2023,GB00BF0HZ991,22-Jul-2023,20-Jul-2017,22 Jan/Jul,12-Jan-2023,34359.0,,,Treasury Gilt 2023
3,2¼% Treasury Gilt 2023,GB00B7Z53659,07-Sep-2023,12-Jun-2013,7 Mar/Sep,24-Feb-2023,35922.0,,,Treasury Gilt 2023
4,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551.0,,,Treasury Gilt 2024


<br><br>
## Step 4: Extracting Coupon Information using String Functions
We clean the Coupon info by considering only the text before the first 'T' or 'I' and decimalise fractions.

In [124]:


# Split the type column at the 'T' or 'I', whichever comes first
df['coupon'] = df['type'].str.split('T|I', expand=True)[0]


# There are some special fraction characters that we should replace
df['coupon'] = df['coupon'].str.replace('¾', ' 3/4').str.replace('¼', ' 1/4').str.replace('¼', ' 1/4').str.replace('½', ' 1/2')

def parse_coupon(coupon):
    # A function to parse the coupon string into a float by decimalising the fractions and summing the parts
    coupon = str(coupon)
    coupon = coupon.replace("%", "")
    coupon = coupon.split("Green Gilt")[0] # Remove the Green Gilt part if it exists
    # The coupon has multiple parts - always an integer then sometimes a fraction
    parts = coupon.split(" ")
    decimalised_parts = []
    for part in parts:
        if "/" in part:
            fraction_parts = part.split("/")
            decimalised = (pd.to_numeric(fraction_parts[0], errors='coerce') / pd.to_numeric(fraction_parts[1], errors='coerce'))
        else:
            decimalised = pd.to_numeric(part, errors='coerce')
        if not pd.isna(decimalised):
            decimalised_parts.append(decimalised) 
    return sum(decimalised_parts)

df['coupon'] = df['coupon'].apply(parse_coupon)


</br><br>
## Step 5: Converting string fields to numeric and standardising dates
We convert the `issueAmount`, `baseRPI`, and `amountINCinfUplift` fields to numeric, handling any invalid entries.
We also make sure the dates are properly formatted.

In [125]:


# Convert relevant columns to numeric (handling NaN values)
df['issueAmount'] = pd.to_numeric(df['issueAmount'].astype(str).str.replace(',', ''), errors='coerce')
df['baseRPI'] = pd.to_numeric(df['baseRPI'].astype(str).str.replace(',', ''), errors='coerce')
df['amountINCinfUplift'] = pd.to_numeric(df['amountINCinfUplift'].astype(str).str.replace(',', ''), errors='coerce')
df.head()

# Extract issue year and convert to numeric (handling NaN values)
df['issueYear'] = pd.to_numeric(df['issueDate'].astype(str).str[-4:], errors='coerce')
df.head()

# Extract issue year and convert to numeric (handling NaN values)
df['issueYear'] = pd.to_numeric(df['issueDate'].astype(str).str[-4:], errors='coerce')
df.head()

# Drop rows with missing values in type, code, and redemptionDate
df = df.dropna(subset=["type", "code", "redemptionDate"])
df.head()

df.sample(5)

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,giltType,coupon,issueYear
64,1 7/8% Index-linked Treasury Gilt 2022,GB00B1Z5HQ14,22-Nov-2022,11-Jul-2007,22 May/Nov,11-Nov-2022,15743.0,205.65806,26554.0,Index-linked Treasury Gilt 2022,1.875,2007.0
25,4¾% Treasury Gilt 2030,GB00B24FF097,07-Dec-2030,03-Oct-2007,7 Jun/Dec,28-Nov-2022,42819.0,,,Treasury Gilt 2030,4.75,2007.0
78,0 5/8% Index-linked Treasury Gilt 2042,GB00B3MYD345,22-Nov-2042,24-Jul-2009,22 May/Nov,11-Nov-2022,12559.0,212.46452,20364.0,Index-linked Treasury Gilt 2042,0.625,2009.0
43,0 7/8% Treasury Gilt 2046,GB00BNNGP775,31-Jan-2046,20-Jan-2021,31 Jan/Jul,20-Jan-2023,20733.0,,,Treasury Gilt 2046,0.875,2021.0
48,0 5/8% Treasury Gilt 2050,GB00BMBL1F74,22-Oct-2050,10-Jun-2020,22 Apr/Oct,13-Oct-2022,31916.0,,,Treasury Gilt 2050,0.625,2020.0


<br><br>
## Step 6: Dropping invalid rows
We drop rows that still contain missing values in critical columns such as `type`, `code`, and `redemptionDate`.

In [126]:
# Drop rows with missing values in type, code, and redemptionDate
df = df.query("giltType != '<NA>'")
df = df.dropna(subset=["giltType", "code", "redemptionDate"])
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,giltType,coupon,issueYear
1,0 1/8% Treasury Gilt 2023,GB00BL68HG94,31-Jan-2023,08-Apr-2020,31 Jan/Jul,20-Jan-2023,33824.0,,,Treasury Gilt 2023,0.125,2020.0
2,0¾% Treasury Gilt 2023,GB00BF0HZ991,22-Jul-2023,20-Jul-2017,22 Jan/Jul,12-Jan-2023,34359.0,,,Treasury Gilt 2023,0.75,2017.0
3,2¼% Treasury Gilt 2023,GB00B7Z53659,07-Sep-2023,12-Jun-2013,7 Mar/Sep,24-Feb-2023,35922.0,,,Treasury Gilt 2023,2.25,2013.0
4,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551.0,,,Treasury Gilt 2024,0.125,2020.0
5,1% Treasury Gilt 2024,GB00BFWFPL34,22-Apr-2024,25-Jul-2018,22 Apr/Oct,13-Oct-2022,35638.0,,,Treasury Gilt 2024,1.0,2018.0


<br><br>
## Step 9: Sorting and calculating cumulative totals <br><br>
We sort the data by `issueYear` and calculate the cumulative total for `issueAmount`.

In [127]:
# Sort by issueYear and calculate cumulative total
df = df.sort_values('issueYear')
df['totalCumulative'] = df['issueAmount'].cumsum()
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,giltType,coupon,issueYear,totalCumulative
94,2½% Index-linked Treasury Stock 2024,GB0008983024,17-Jul-2024,30-Dec-1986,17 Jan/Jul,06-Jan-2023,6821.0,97.667934,22363.0,Index-linked Treasury Stock 2024,2.5,1986.0,6821.0
95,4 1/8% Index-linked Treasury Stock 2030,GB0008932666,22-Jul-2030,12-Jun-1992,22 Jan/Jul,12-Jan-2023,4841.0,135.1,11474.0,Index-linked Treasury Stock 2030,4.125,1992.0,11662.0
20,6% Treasury Stock 2028,GB0002404191,07-Dec-2028,29-Jan-1998,7 Jun/Dec,28-Nov-2022,20256.0,,,Treasury Stock 2028,6.0,1998.0,31918.0
28,4¼% Treasury Stock 2032,GB0004893086,07-Jun-2032,25-May-2000,7 Jun/Dec,28-Nov-2022,40331.0,,,Treasury Stock 2032,4.25,2000.0,72249.0
8,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339.0,,,Treasury Stock 2025,5.0,2001.0,109588.0


<br><br>
## Step 10: Exporting the cleaned data 
Finally, we export the cleaned dataset to a CSV file.

In [128]:
# Export cleaned data to CSV
df.to_csv("GiltsInIssueClean.csv", index=False)

# Keep just the columns we want
df = df[['giltType', 'code', 'issueYear', 'issueAmount', 'totalCumulative', 'coupon']]

# Display the cleaned DataFrame
df

Unnamed: 0,giltType,code,issueYear,issueAmount,totalCumulative,coupon
94,Index-linked Treasury Stock 2024,GB0008983024,1986.0,6821.0,6821.0,2.500
95,Index-linked Treasury Stock 2030,GB0008932666,1992.0,4841.0,11662.0,4.125
20,Treasury Stock 2028,GB0002404191,1998.0,20256.0,31918.0,6.000
28,Treasury Stock 2032,GB0004893086,2000.0,40331.0,72249.0,4.250
8,Treasury Stock 2025,GB0030880693,2001.0,37339.0,109588.0,5.000
...,...,...,...,...,...,...
49,Treasury Gilt 2051,GB00BLH38158,2021.0,26049.0,1989151.0,1.250
75,Index-linked Treasury Gilt 2039,GB00BLH38265,2021.0,6303.0,1995454.0,0.125
21,Treasury Gilt 2029,GB00BLPK7227,2021.0,17544.0,2012998.0,0.500
15,Treasury Gilt 2027,GB00BL6C7720,2022.0,4445.0,2017443.0,4.125
