# UK Debt Analysis - DMO Data Cleaning

This notebook demonstrates how to clean and process UK Debt DMO data, applying string functions, numerical transformations, and filtering to create a clean dataset.

</br></br>

First we load the dataset, instructing `pandas` to skip the first rows, which just contain metadata.

In [1]:
import pandas as pd

# Load the data and skip the first 12 rows
df = pd.read_csv("https://github.com/RDeconomist/RDeconomist.github.io/raw/refs/heads/main/data/GiltsInIssueRaw.csv", encoding='ISO-8859-1', skiprows=12)
df.head()

Unnamed: 0,2¼% Treasury Gilt 2023,GB00B7Z53659,07-Sep-2023,12-Jun-2013,7 Mar/Sep,24-Feb-2023,"35,922",Unnamed: 7,Unnamed: 8
0,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551,,
1,1% Treasury Gilt 2024,GB00BFWFPL34,22-Apr-2024,25-Jul-2018,22 Apr/Oct,13-Oct-2022,35638,,
2,2¾% Treasury Gilt 2024,GB00BHBFH458,07-Sep-2024,12-Mar-2014,7 Mar/Sep,24-Feb-2023,35806,,
3,0¼% Treasury Gilt 2025,GB00BLPK7110,31-Jan-2025,02-Jul-2021,31 Jan/Jul,20-Jan-2023,29353,,
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339,,


<br><br>
## Step 1: Renaming columns 
We rename the columns to reflect the data they contain.

In [2]:
# Rename columns
df.columns = ['type', 'code', 'redemptionDate', 'issueDate', 'divDates', 'divNextDate', 'issueAmount', 'baseRPI', 'amountINCinfUplift']
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift
0,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551,,
1,1% Treasury Gilt 2024,GB00BFWFPL34,22-Apr-2024,25-Jul-2018,22 Apr/Oct,13-Oct-2022,35638,,
2,2¾% Treasury Gilt 2024,GB00BHBFH458,07-Sep-2024,12-Mar-2014,7 Mar/Sep,24-Feb-2023,35806,,
3,0¼% Treasury Gilt 2025,GB00BLPK7110,31-Jan-2025,02-Jul-2021,31 Jan/Jul,20-Jan-2023,29353,,
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339,,


<br><br>
## Step 2: Dropping unnecessary rows
We drop rows with missing or invalid values in the columns `type`, `code`, and `redemptionDate`.

In [3]:
# Drop empty rows based on type, code, and redemptionDate
df = df[(df['type'] != "") & (df['code'] != "ISIN Code") & (df['code'] != "") & (df['redemptionDate'] != "")]
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift
0,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551,,
1,1% Treasury Gilt 2024,GB00BFWFPL34,22-Apr-2024,25-Jul-2018,22 Apr/Oct,13-Oct-2022,35638,,
2,2¾% Treasury Gilt 2024,GB00BHBFH458,07-Sep-2024,12-Mar-2014,7 Mar/Sep,24-Feb-2023,35806,,
3,0¼% Treasury Gilt 2025,GB00BLPK7110,31-Jan-2025,02-Jul-2021,31 Jan/Jul,20-Jan-2023,29353,,
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339,,


 <br><br>
## Step 3: Splitting the `type` column
We split the `type` column to extract the Gilt Type and coupon information.

In [4]:
# Split the 'type' column and clean up
df[['rawType', 'extra']] = df['type'].str.split("T", n=1, expand=True)
df['giltType'] = df['rawType'].str.strip()
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,rawType,extra,giltType
0,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551,,,0 1/8%,reasury Gilt 2024,0 1/8%
1,1% Treasury Gilt 2024,GB00BFWFPL34,22-Apr-2024,25-Jul-2018,22 Apr/Oct,13-Oct-2022,35638,,,1%,reasury Gilt 2024,1%
2,2¾% Treasury Gilt 2024,GB00BHBFH458,07-Sep-2024,12-Mar-2014,7 Mar/Sep,24-Feb-2023,35806,,,2¾%,reasury Gilt 2024,2¾%
3,0¼% Treasury Gilt 2025,GB00BLPK7110,31-Jan-2025,02-Jul-2021,31 Jan/Jul,20-Jan-2023,29353,,,0¼%,reasury Gilt 2025,0¼%
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339,,,5%,reasury Stock 2025,5%


<br><br>
## Step 4: Processing the `giltType` 
We fill missing values in `giltType` with "Green Gilt" and adjust for index-linked gilts.

In [5]:
# Add "T" back into the second part of the split if necessary
df['giltType'] = df['giltType'].replace("", "Green Gilt")
df['giltType'] = df.apply(lambda x: "Index Linked" if pd.notna(x['baseRPI']) else x['giltType'], axis=1)
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,rawType,extra,giltType
0,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551,,,0 1/8%,reasury Gilt 2024,0 1/8%
1,1% Treasury Gilt 2024,GB00BFWFPL34,22-Apr-2024,25-Jul-2018,22 Apr/Oct,13-Oct-2022,35638,,,1%,reasury Gilt 2024,1%
2,2¾% Treasury Gilt 2024,GB00BHBFH458,07-Sep-2024,12-Mar-2014,7 Mar/Sep,24-Feb-2023,35806,,,2¾%,reasury Gilt 2024,2¾%
3,0¼% Treasury Gilt 2025,GB00BLPK7110,31-Jan-2025,02-Jul-2021,31 Jan/Jul,20-Jan-2023,29353,,,0¼%,reasury Gilt 2025,0¼%
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339,,,5%,reasury Stock 2025,5%


<br><br>
## Step 5: Extracting Coupon Information using String Functions

In this step, we extract coupon information from the `rawType` column, which can appear in two formats: fractions (e.g., "5/8%") or decimals (e.g., "2.5%"). We rely on Python’s built-in string functions.

1. **Identifying fractions**: We look for entries containing a `/` and extract the fraction (e.g., `5/8`). This is done by splitting the string at the `%` symbol and then taking the last token in the string, which represents the coupon.

2. **Identifying decimals**: Similarly, we look for entries containing a `.` to identify decimals (e.g., `2.5`). We extract the decimal value in the same way by splitting at the `%` symbol and then taking the last token.

3. **Cleaning the extracted coupons**: Finally, we remove any leading or trailing spaces to ensure the coupon values are clean.

In [6]:
# Create an empty 'coupon' column
df['coupon'] = None

# Step 1: Extract fraction-based coupons (like "5/8%")
mask_fraction = df['rawType'].str.contains('/')
df.loc[mask_fraction, 'coupon'] = df.loc[mask_fraction, 'rawType'].str.split('%').str[0].str.split().str[-1]

# Step 2: Extract decimal-based coupons (like "2.5%")
mask_decimal = df['rawType'].str.contains('\.')
df.loc[mask_decimal, 'coupon'] = df.loc[mask_decimal, 'rawType'].str.split('%').str[0].str.split().str[-1]

# Step 3: Clean up the extracted coupons (remove any spaces)
df['coupon'] = df['coupon'].str.strip()

# Optionally display the resulting DataFrame with the extracted coupons
df[['rawType', 'coupon']].head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,rawType,extra,giltType,coupon
0,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551,,,0 1/8%,reasury Gilt 2024,0 1/8%,1/8
1,1% Treasury Gilt 2024,GB00BFWFPL34,22-Apr-2024,25-Jul-2018,22 Apr/Oct,13-Oct-2022,35638,,,1%,reasury Gilt 2024,1%,
2,2¾% Treasury Gilt 2024,GB00BHBFH458,07-Sep-2024,12-Mar-2014,7 Mar/Sep,24-Feb-2023,35806,,,2¾%,reasury Gilt 2024,2¾%,
3,0¼% Treasury Gilt 2025,GB00BLPK7110,31-Jan-2025,02-Jul-2021,31 Jan/Jul,20-Jan-2023,29353,,,0¼%,reasury Gilt 2025,0¼%,
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339,,,5%,reasury Stock 2025,5%,


</br><br>
## Step 6: Converting string fields to numeric <br><br>
We convert the `issueAmount`, `baseRPI`, and `amountINCinfUplift` fields to numeric, handling any invalid entries.

In [7]:
# Convert relevant columns to numeric (handling NaN values)
df['issueAmount'] = pd.to_numeric(df['issueAmount'].str.replace(',', ''), errors='coerce')
df['baseRPI'] = pd.to_numeric(df['baseRPI'].str.replace(',', ''), errors='coerce')
df['amountINCinfUplift'] = pd.to_numeric(df['amountINCinfUplift'].str.replace(',', ''), errors='coerce')
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,rawType,extra,giltType,coupon
0,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551.0,,,0 1/8%,reasury Gilt 2024,0 1/8%,1/8
1,1% Treasury Gilt 2024,GB00BFWFPL34,22-Apr-2024,25-Jul-2018,22 Apr/Oct,13-Oct-2022,35638.0,,,1%,reasury Gilt 2024,1%,
2,2¾% Treasury Gilt 2024,GB00BHBFH458,07-Sep-2024,12-Mar-2014,7 Mar/Sep,24-Feb-2023,35806.0,,,2¾%,reasury Gilt 2024,2¾%,
3,0¼% Treasury Gilt 2025,GB00BLPK7110,31-Jan-2025,02-Jul-2021,31 Jan/Jul,20-Jan-2023,29353.0,,,0¼%,reasury Gilt 2025,0¼%,
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339.0,,,5%,reasury Stock 2025,5%,


<br><br>
## Step 7: Extracting and processing the issue year 
We extract the year from the `issueDate` and convert it to numeric.

In [8]:
# Extract issue year and convert to numeric (handling NaN values)
df['issueYear'] = pd.to_numeric(df['issueDate'].str[-4:], errors='coerce')
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,rawType,extra,giltType,coupon,issueYear
0,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551.0,,,0 1/8%,reasury Gilt 2024,0 1/8%,1/8,2020.0
1,1% Treasury Gilt 2024,GB00BFWFPL34,22-Apr-2024,25-Jul-2018,22 Apr/Oct,13-Oct-2022,35638.0,,,1%,reasury Gilt 2024,1%,,2018.0
2,2¾% Treasury Gilt 2024,GB00BHBFH458,07-Sep-2024,12-Mar-2014,7 Mar/Sep,24-Feb-2023,35806.0,,,2¾%,reasury Gilt 2024,2¾%,,2014.0
3,0¼% Treasury Gilt 2025,GB00BLPK7110,31-Jan-2025,02-Jul-2021,31 Jan/Jul,20-Jan-2023,29353.0,,,0¼%,reasury Gilt 2025,0¼%,,2021.0
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339.0,,,5%,reasury Stock 2025,5%,,2001.0


<br><br>
## Step 8: Dropping invalid rows
We drop rows that still contain missing values in critical columns such as `type`, `code`, and `redemptionDate`.

In [9]:
# Drop rows with missing values in type, code, and redemptionDate
df = df.dropna(subset=["type", "code", "redemptionDate"])
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,rawType,extra,giltType,coupon,issueYear
0,0 1/8% Treasury Gilt 2024,GB00BMGR2791,31-Jan-2024,07-Oct-2020,31 Jan/Jul,20-Jan-2023,35551.0,,,0 1/8%,reasury Gilt 2024,0 1/8%,1/8,2020.0
1,1% Treasury Gilt 2024,GB00BFWFPL34,22-Apr-2024,25-Jul-2018,22 Apr/Oct,13-Oct-2022,35638.0,,,1%,reasury Gilt 2024,1%,,2018.0
2,2¾% Treasury Gilt 2024,GB00BHBFH458,07-Sep-2024,12-Mar-2014,7 Mar/Sep,24-Feb-2023,35806.0,,,2¾%,reasury Gilt 2024,2¾%,,2014.0
3,0¼% Treasury Gilt 2025,GB00BLPK7110,31-Jan-2025,02-Jul-2021,31 Jan/Jul,20-Jan-2023,29353.0,,,0¼%,reasury Gilt 2025,0¼%,,2021.0
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339.0,,,5%,reasury Stock 2025,5%,,2001.0


<br><br>
## Step 9: Sorting and calculating cumulative totals <br><br>
We sort the data by `issueYear` and calculate the cumulative total for `issueAmount`.

In [10]:
# Sort by issueYear and calculate cumulative total
df = df.sort_values('issueYear')
df['totalCumulative'] = df['issueAmount'].cumsum()
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,rawType,extra,giltType,coupon,issueYear,totalCumulative
90,2½% Index-linked Treasury Stock 2024,GB0008983024,17-Jul-2024,30-Dec-1986,17 Jan/Jul,06-Jan-2023,6821.0,97.667934,22363.0,2½% Index-linked,reasury Stock 2024,Index Linked,,1986.0,6821.0
91,4 1/8% Index-linked Treasury Stock 2030,GB0008932666,22-Jul-2030,12-Jun-1992,22 Jan/Jul,12-Jan-2023,4841.0,135.1,11474.0,4 1/8% Index-linked,reasury Stock 2030,Index Linked,1/8,1992.0,11662.0
16,6% Treasury Stock 2028,GB0002404191,07-Dec-2028,29-Jan-1998,7 Jun/Dec,28-Nov-2022,20256.0,,,6%,reasury Stock 2028,6%,,1998.0,31918.0
24,4¼% Treasury Stock 2032,GB0004893086,07-Jun-2032,25-May-2000,7 Jun/Dec,28-Nov-2022,40331.0,,,4¼%,reasury Stock 2032,4¼%,,2000.0,72249.0
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339.0,,,5%,reasury Stock 2025,5%,,2001.0,109588.0


<br><br>
## Step 10: Assigning unique IDs 
We assign a unique ID to each bond in the dataset.

In [11]:
# Assign unique IDs
df['id'] = range(1, len(df) + 1)
df.head()

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,rawType,extra,giltType,coupon,issueYear,totalCumulative,id
90,2½% Index-linked Treasury Stock 2024,GB0008983024,17-Jul-2024,30-Dec-1986,17 Jan/Jul,06-Jan-2023,6821.0,97.667934,22363.0,2½% Index-linked,reasury Stock 2024,Index Linked,,1986.0,6821.0,1
91,4 1/8% Index-linked Treasury Stock 2030,GB0008932666,22-Jul-2030,12-Jun-1992,22 Jan/Jul,12-Jan-2023,4841.0,135.1,11474.0,4 1/8% Index-linked,reasury Stock 2030,Index Linked,1/8,1992.0,11662.0,2
16,6% Treasury Stock 2028,GB0002404191,07-Dec-2028,29-Jan-1998,7 Jun/Dec,28-Nov-2022,20256.0,,,6%,reasury Stock 2028,6%,,1998.0,31918.0,3
24,4¼% Treasury Stock 2032,GB0004893086,07-Jun-2032,25-May-2000,7 Jun/Dec,28-Nov-2022,40331.0,,,4¼%,reasury Stock 2032,4¼%,,2000.0,72249.0,4
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339.0,,,5%,reasury Stock 2025,5%,,2001.0,109588.0,5


<br><br>
## Step 11: Exporting the cleaned data 
Finally, we export the cleaned dataset to a CSV file.

In [12]:
# Export cleaned data to CSV
df.to_csv("GiltsInIssueClean.csv", index=False)

# Keep just the columns we want
df = df[['id', 'giltType', 'code', 'issueYear', 'issueAmount', 'totalCumulative', 'coupon']]

# Display the cleaned DataFrame
df

Unnamed: 0,type,code,redemptionDate,issueDate,divDates,divNextDate,issueAmount,baseRPI,amountINCinfUplift,rawType,extra,giltType,coupon,issueYear,totalCumulative,id
90,2½% Index-linked Treasury Stock 2024,GB0008983024,17-Jul-2024,30-Dec-1986,17 Jan/Jul,06-Jan-2023,6821.0,97.667934,22363.0,2½% Index-linked,reasury Stock 2024,Index Linked,,1986.0,6821.0,1
91,4 1/8% Index-linked Treasury Stock 2030,GB0008932666,22-Jul-2030,12-Jun-1992,22 Jan/Jul,12-Jan-2023,4841.0,135.100000,11474.0,4 1/8% Index-linked,reasury Stock 2030,Index Linked,1/8,1992.0,11662.0,2
16,6% Treasury Stock 2028,GB0002404191,07-Dec-2028,29-Jan-1998,7 Jun/Dec,28-Nov-2022,20256.0,,,6%,reasury Stock 2028,6%,,1998.0,31918.0,3
24,4¼% Treasury Stock 2032,GB0004893086,07-Jun-2032,25-May-2000,7 Jun/Dec,28-Nov-2022,40331.0,,,4¼%,reasury Stock 2032,4¼%,,2000.0,72249.0,4
4,5% Treasury Stock 2025,GB0030880693,07-Mar-2025,27-Sep-2001,7 Mar/Sep,24-Feb-2023,37339.0,,,5%,reasury Stock 2025,5%,,2001.0,109588.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
47,1½% Green Gilt 2053,GB00BM8Z2V59,31-Jul-2053,22-Oct-2021,31 Jan/Jul,20-Jan-2023,10854.0,,,1½% Green Gilt 2053,,1½% Green Gilt 2053,,2021.0,1916623.0,82
71,0 1/8% Index-linked Treasury Gilt 2039,GB00BLH38265,22-Mar-2039,26-May-2021,22 Mar/Sep,13-Mar-2023,6303.0,296.725810,7317.0,0 1/8% Index-linked,reasury Gilt 2039,Index Linked,1/8,2021.0,1922926.0,83
66,0 1/8% Index-linked Treasury Gilt 2031,GB00BNNGP551,10-Aug-2031,28-Jan-2021,10 Feb/Aug,01-Feb-2023,9563.0,293.603230,11221.0,0 1/8% Index-linked,reasury Gilt 2031,Index Linked,1/8,2021.0,1932489.0,84
11,4 1/8% Treasury Gilt 2027,GB00BL6C7720,29-Jan-2027,13-Oct-2022,29 Jan/Jul,19-Jan-2023,4445.0,,,4 1/8%,reasury Gilt 2027,4 1/8%,1/8,2022.0,1936934.0,85
