# Working with the DCMS museum KPIs dataset

In this notebook I’m going to pull out the key metrics from the DCMS annual performance indicators dataset and turn them into tidy tables I can use for analysis. The raw file is split across lots of sheets, so the idea is to standardise everything and get it into a clean, consistent format.

By the end I’ll have three separate CSVs saved into the `footfall/data` folder:

- all visitor numbers by museum and financial year  
- all income KPIs by museum and financial year  
- the percentage of visitors who would recommend a visit  

These will give me a tidy set of datasets that I can use for the group project without having to keep going back to the raw DCMS file.


In [8]:
import pandas as pd
import numpy as np

data_path = r"..\data\DCMS_sponsored_museums_and_galleries_annual_performance_indicators_2023_24_tables.xlsx"
dcms_file = pd.ExcelFile(data_path)

# checking the sheet names
dcms_file.sheet_names


['Cover_sheet',
 'Contents',
 'Notes',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12']

## Looking at what’s inside the file

Now that I’ve got the file open I’m just going to take a quick look at the sheet names so I know what I’m working with. The DCMS dataset spreads everything across different tables, so the idea is to pick out the ones that actually matter for the project.

For this piece of work I’m going to use:

- Table 1 for total visitors  
- Table 3 for visitors under 16  
- Table 4 for overseas visitors  
- Table 6 for the percentage of visitors who would recommend a visit  
- Tables 10, 11 and 12 for the income KPIs  

Everything else I can ignore for now.


In [9]:
# loading the sheets I actually need from the DCMS file
# each one becomes its own dataframe so I can clean them separately

# visitor numbers
table1_total = pd.read_excel(dcms_file, sheet_name="1")      # total visitors (Table 1)
table3_under16 = pd.read_excel(dcms_file, sheet_name="3")    # visitors under 16 (Table 3)
table4_overseas = pd.read_excel(dcms_file, sheet_name="4")   # overseas visitors (Table 4)

# recommendation percentages
table6_recommend = pd.read_excel(dcms_file, sheet_name="6")  # % who would recommend a visit (Table 6)

# income tables
table10_admissions = pd.read_excel(dcms_file, sheet_name="10")  # admissions income (Table 10)
table11_trading = pd.read_excel(dcms_file, sheet_name="11")     # trading income (Table 11)
table12_fundraising = pd.read_excel(dcms_file, sheet_name="12") # fundraising income (Table 12)

# checking one of the tables to make sure everything loaded ok
table1_total.head()


Unnamed: 0,"Table 1: Total annual visitor figures to the DCMS-sponsored museum and galleries, split by museum 2008/09 - 2023/24 [note 1]",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17
0,This worksheet contains one table. Some cells ...,,,,,,,,,,,,,,,,,
1,Colour is used for emphasis in this table. The...,,,,,,,,,,,,,,,,,
2,"Some shorthand is used in this table, x = data...",,,,,,,,,,,,,,,,,
3,Name of museum or gallery,2008/09,2009/10,2010/11 [b],2011/12,2012/13,2013/14,2014/15 [b],2015/16,2016/17,2017/18,2018/19,2019/20,2020/21,2021/22,2022/23,2023/24,Notes
4,British Museum,5472056,5650388,5869396,5841658,5592814,6758935,6677990,6853540,6229028,5822515,6025471,5943006,160105,2045214,4544598,6159856,[note 2] The cell P6 is a revised figure.


### Step 1: Cleaning the visitor tables

The next step is to tidy up the three visitor tables. They all come with the same layout: a few lines of notes at the top, a header row buried further down, and a “Notes” column on the end that I don’t need. Once I’ve stripped all that out, I can reshape the data so I’ve got one row per museum per year, with a clear visitor type attached.

I’m going to start by cleaning:
- total visitors (Table 1)  
- visitors under 16 (Table 3)  
- overseas visitors (Table 4)

After that I’ll add an “Other” category by subtracting under 16s and overseas from the total.


In [10]:
# --- Cleaning the Total Visitors table (Table 1) ---

# Load raw sheet with no header
raw1 = pd.read_excel(dcms_file, sheet_name="1", header=None)

# Row 4 contains the real header
header1 = raw1.iloc[4]

# Data starts from row 5 downward
table1 = raw1.iloc[5:].copy()

# Apply header
table1.columns = header1

# Drop Notes column
table1 = table1.drop(columns=[col for col in table1.columns if "Notes" in str(col)])

# Rename museum column
table1 = table1.rename(columns={"Name of museum or gallery": "Museum"})

# Remove footnote markers in square brackets from museum names
table1["Museum"] = (
    table1["Museum"]
    .astype(str)
    .str.replace(r"\[.*?\]", "", regex=True)
    .str.strip()
)

# Melt into long format
table1_long = table1.melt(
    id_vars="Museum",
    var_name="Year",
    value_name="VisitorCount"
)

# Keep Year as clean text
table1_long["Year"] = table1_long["Year"].astype(str).str.strip()

# Convert visitor counts to integers (safe)
table1_long["VisitorCount"] = (
    table1_long["VisitorCount"]
    .astype(str)
    .str.replace(r"[^0-9]", "", regex=True)
    .replace("", np.nan)
    .astype("Int64")
)

# Add visitor type label
table1_long["VisitorType"] = "Total"

table1_long.head()


Unnamed: 0,Museum,Year,VisitorCount,VisitorType
0,British Museum,2008/09,5472056,Total
1,Museum of the Home,2008/09,86499,Total
2,Horniman Museum,2008/09,483113,Total
3,Imperial War Museums,2008/09,2006765,Total
4,Museum of Science and Industry in Manchester,2008/09,745188,Total


In [11]:
# --- Cleaning the Under-16 Visitors table (Table 3) ---

# Load sheet with no header
raw3 = pd.read_excel(dcms_file, sheet_name="3", header=None)

# Row 4 contains the proper header
header3 = raw3.iloc[4]

# Data begins from row 5
table3 = raw3.iloc[5:].copy()

# Apply header row
table3.columns = header3

# Drop Notes column if present
table3 = table3.drop(columns=[col for col in table3.columns if "Notes" in str(col)])

# Rename museum column
table3 = table3.rename(columns={"Name of museum or gallery": "Museum"})

# Remove bracketed notes from museum names
table3["Museum"] = (
    table3["Museum"]
    .astype(str)
    .str.replace(r"\[.*?\]", "", regex=True)
    .str.strip()
)

# Melt into long format
table3_long = table3.melt(
    id_vars="Museum",
    var_name="Year",
    value_name="VisitorCount"
)

# Clean Year values
table3_long["Year"] = table3_long["Year"].astype(str).str.strip()

# Clean VisitorCount before converting
table3_long["VisitorCount"] = (
    table3_long["VisitorCount"]
    .astype(str)
    .str.replace(r"[^0-9]", "", regex=True)
    .replace("", np.nan)
    .astype("Int64")
)

# Label visitor type
table3_long["VisitorType"] = "Under 16"

# Preview
table3_long.head()


Unnamed: 0,Museum,Year,VisitorCount,VisitorType
0,British Museum,2008/09,723592,Under 16
1,Museum of the Home,2008/09,21021,Under 16
2,Horniman Museum,2008/09,276951,Under 16
3,Imperial War Museums,2008/09,580439,Under 16
4,Museum of Science and Industry in Manchester,2008/09,298050,Under 16


In [12]:
# --- Cleaning Table 4: Overseas visitors ---

# load sheet with no header so Excel does not guess datatypes
raw4 = pd.read_excel(dcms_file, sheet_name="4", header=None)

# row 4 is the header row
header4 = raw4.iloc[4]

# data begins on row 5
table4 = raw4.iloc[5:].copy()

# assign header row as column names
table4.columns = header4

# drop the notes column if present
table4 = table4.drop(columns=[col for col in table4.columns if "Notes" in str(col)])

# rename museum column
table4 = table4.rename(columns={"Name of museum or gallery": "Museum"})

# remove square-bracket notes from museum names
table4["Museum"] = (
    table4["Museum"]
    .astype(str)
    .str.replace(r"\[.*?\]", "", regex=True)
    .str.strip()
)

# reshape into long format
table4_long = table4.melt(
    id_vars="Museum",
    var_name="Year",
    value_name="VisitorCount"
)

# clean Year values
table4_long["Year"] = table4_long["Year"].astype(str).str.strip()

# clean VisitorCount: remove non-numbers, treat x as missing
table4_long["VisitorCount"] = (
    table4_long["VisitorCount"]
    .astype(str)
    .str.replace(r"[^0-9]", "", regex=True)
    .replace("", np.nan)
    .astype("Int64")
)

# assign visitor type
table4_long["VisitorType"] = "Overseas"

table4_long.head()


Unnamed: 0,Museum,Year,VisitorCount,VisitorType
0,British Museum,2008/09,3228234,Overseas
1,Museum of the Home,2008/09,9000,Overseas
2,Horniman Museum,2008/09,9092,Overseas
3,Imperial War Museums,2008/09,634702,Overseas
4,Museum of Science and Industry in Manchester,2008/09,745188,Overseas


### Step 1.1: Creating an “Other Visitors” category

The visitor numbers dataset includes totals, overseas visitors, and visitors under 16.  
To make the dataset more useful I am adding a fourth category: **Other Visitors**.  
This represents everyone who is not overseas and not under 16.

The value is calculated using:

**Other Visitors = Total Visitors − Under-16 Visitors − Overseas Visitors**

To do this I will temporarily reshape the three visitor tables into wide format, subtract the columns to create the new values, then convert everything back into long format so it matches the structure of the other cleaned visitor tables.


In [14]:
# --- Create the Other Visitors category (Total - Overseas - Under 16) ---

# Step 1: Pivot each visitor table into wide format
total_wide = table1_long.pivot(
    index=["Museum", "Year"],
    columns="VisitorType",
    values="VisitorCount"
)

under16_wide = table3_long.pivot(
    index=["Museum", "Year"],
    columns="VisitorType",
    values="VisitorCount"
)

overseas_wide = table4_long.pivot(
    index=["Museum", "Year"],
    columns="VisitorType",
    values="VisitorCount"
)

# Step 2: Combine the tables
combined = (
    total_wide.join(under16_wide, how="left")
              .join(overseas_wide, how="left")
)

# Ensure the expected column names exist
# (Total, Under 16, Overseas)
combined = combined.rename(columns={
    "Under 16": "Under16",    # safer column name for maths
    "Overseas": "Overseas"
})

# Step 3: Replace missing values with zero for safe subtraction
combined = combined.fillna(0)

# Step 4: Calculate Other Visitors
combined["Other"] = (
    combined["Total"]
    - combined["Overseas"]
    - combined["Under16"]
)

# Step 5: Convert back to long format
other_visitors_long = (
    combined["Other"]
    .reset_index()
    .rename(columns={"Other": "VisitorCount"})
)

# Step 6: Label visitor type
other_visitors_long["VisitorType"] = "Other"

# Preview
other_visitors_long.head()


Unnamed: 0,Museum,Year,VisitorCount,VisitorType
0,British Museum,2008/09,1520230,Other
1,British Museum,2009/10,1275931,Other
2,British Museum,2010/11 [b],1373958,Other
3,British Museum,2011/12,1571646,Other
4,British Museum,2012/13,1112189,Other


In [16]:
# --- Final cleanup: normalise Year values across all visitor datasets ---

for df in [table1_long, table3_long, table4_long, other_visitors_long]:
    df["Year"] = (
        df["Year"]
        .astype(str)
        .str.replace(r"\[.*?\]", "", regex=True)  # remove anything in square brackets
        .str.strip()
    )


### Step 2: Cleaning the recommendation table

Table 6 contains the percentage of visitors who would recommend a visit. Its structure follows the same pattern as the other KPI sheets, with a block of metadata at the top, the real header row sitting further down, and the museum figures listed underneath.

For this step I am working with:

- **Table 6** for the proportion of visitors who would recommend a visit

The cleaning process mirrors the visitor tables. I'll load the sheet without a header, extract the correct header row, and remove any surrounding notes. The museum names need tidying to strip out bracketed notes, and the “x” placeholders are converted into missing values so they can be handled consistently. Any bracketed year columns are removed to leave only the proper financial years.

Once the table is standardised, I can reshape it into long format so that each row represents one museum and one year, with a single percentage value. The final step is to convert these cleaned values into numeric form ready for analysis.


In [17]:
# --- Cleaning the Recommendation Percentages table (Table 6) ---

raw6 = pd.read_excel(dcms_file, sheet_name="6", header=None)

# row 4 contains the real header
header6 = raw6.iloc[4].astype(str).str.strip()

# table body begins at row 5
table6 = raw6.iloc[5:].copy()
table6.columns = header6

# drop the Notes column
table6 = table6.drop(columns=[c for c in table6.columns if "Notes" in str(c)])

# clean museum names
table6["Name of museum or gallery"] = (
    table6["Name of museum or gallery"]
    .astype(str)
    .str.replace(r"\[.*?\]", "", regex=True)
    .str.strip()
)

# replace x with NaN
table6 = table6.replace("x", np.nan)

# melt to long form
table6_long = table6.melt(
    id_vars="Name of museum or gallery",
    var_name="Year",
    value_name="RecommendPercent"
)

table6_long = table6_long.rename(columns={"Name of museum or gallery": "Museum"})

# Convert to numeric safely
table6_long["RecommendPercent"] = pd.to_numeric(table6_long["RecommendPercent"], errors="coerce")

# Excel stores percentages as decimals (e.g., 0.85 = 85%) so multiply by 100
table6_long["RecommendPercent"] = (table6_long["RecommendPercent"] * 100).round()

# Convert to integer type
table6_long["RecommendPercent"] = table6_long["RecommendPercent"].astype("Int64")

# clean Year values (remove [b], [note], etc.)
table6_long["Year"] = (
    table6_long["Year"]
    .astype(str)
    .str.replace(r"\[.*?\]", "", regex=True)
    .str.strip()
)

table6_long.head()


  table6 = table6.replace("x", np.nan)


Unnamed: 0,Museum,Year,RecommendPercent
0,British Museum,2008/09,85
1,Museum of the Home,2008/09,86
2,Horniman Museum,2008/09,97
3,Imperial War Museums,2008/09,98
4,Museum of Science and Industry in Manchester,2008/09,78


## Step 3: Cleaning the income tables

The income information is split across three separate sheets, each covering a different type of revenue reported by the museums. The structure is broadly the same as the visitor tables, with a few rows of metadata at the top, a proper header row partway down, and the museum figures listed underneath.

For this step I am working with:

- **Table 10** for admissions income  
- **Table 11** for trading income  
- **Table 12** for fundraising income  

Each table needs the same treatment: load it without a header, identify the correct header row, remove any surrounding notes, and clean up the museum names. The bracketed year columns need to be dropped so only the real financial years remain. After that I can reshape each table into long format so that every row represents one museum and one year, with a single income value. The final step is to convert the cleaned income values into numeric form and label each table with its income type.

Once all three have been cleaned in the same way, they can be combined into one standardised income dataset.


In [18]:
# --- Cleaning Admissions Income (Table 10) ---

import pandas as pd
import numpy as np

# Load sheet 10 with no header
raw10 = pd.read_excel(
    dcms_file,
    sheet_name="10",
    header=None,
    dtype=str
)

# Header is on row index 3
header10 = raw10.iloc[3].astype(str).str.strip()

# Data starts on row index 4
table10 = raw10.iloc[4:].copy()
table10.columns = header10

# Standardise column name
table10 = table10.rename(columns={"Name of museum or gallery": "Museum"})

# Remove [notes] from museum names
table10["Museum"] = (
    table10["Museum"]
    .astype(str)
    .str.replace(r"\[.*?\]", "", regex=True)
    .str.strip()
)

# Convert "x" and "N/A" to missing
table10 = table10.replace({"x": np.nan, "N/A": np.nan})

# Keep only real year columns
year_cols_10 = [col for col in table10.columns if col not in ["Museum"] and "[" not in col]

# Reshape long
table10_long = table10.melt(
    id_vars="Museum",
    value_vars=year_cols_10,
    var_name="Year",
    value_name="Income"
)

# Clean numeric values
table10_long["Income"] = (
    table10_long["Income"]
    .astype(str)
    .str.replace(r"[^\d]", "", regex=True)
    .replace("", np.nan)
)

# Convert to integer
table10_long["Income"] = table10_long["Income"].astype("Int64")

# Label income type
table10_long["IncomeType"] = "Admissions"

# clean Year values (remove [b], [note], etc.)
table10_long["Year"] = (
    table10_long["Year"]
    .astype(str)
    .str.replace(r"\[.*?\]", "", regex=True)
    .str.strip()
)

table10_long.head()


Unnamed: 0,Museum,Year,Income,IncomeType
0,British Museum,2008/09,3100000.0,Admissions
1,Museum of the Home,2008/09,,Admissions
2,Horniman Museum,2008/09,5809.0,Admissions
3,Imperial War Museums,2008/09,4752000.0,Admissions
4,Museum of Science and Industry in Manchester,2008/09,1541874.0,Admissions


In [19]:
# --- Cleaning Trading Income (Table 11) ---

# Read the sheet with all values as strings to avoid Excel auto-conversion
raw11 = pd.read_excel(
    dcms_file,
    sheet_name="11",
    header=None,
    dtype=str
)

# Header row is row 4 in Excel (raw index 3 or 4? Let's verify from your paste)
# From your pasted table structure: Name of museum or gallery is ON ROW 4 → raw index = 4
header11 = raw11.iloc[4].astype(str).str.strip()

# Build a proper dataframe
table11 = raw11.iloc[5:].copy()
table11.columns = header11

# Drop rows where the museum name is missing
table11 = table11[table11["Name of museum or gallery"].notna()].copy()

# Rename museum column
table11 = table11.rename(columns={"Name of museum or gallery": "Museum"})

# Clean museum names: remove bracketed notes, trim whitespace
table11["Museum"] = (
    table11["Museum"]
    .astype(str)
    .str.replace(r"\[.*?\]", "", regex=True)
    .str.strip()
)

# Identify real year columns (exclude those containing “[")
year_cols_11 = [
    col for col in table11.columns
    if col != "Museum" and "[" not in str(col)
]

# Reshape to long format
table11_long = table11.melt(
    id_vars="Museum",
    value_vars=year_cols_11,
    var_name="Year",
    value_name="Income"
)

# Clean income values
table11_long["Income"] = (
    table11_long["Income"]
    .astype(str)
    .str.replace(",", "", regex=False)        # remove commas
    .str.replace("x", "", regex=False)        # blank out x
    .str.replace("N/A", "", regex=False)      # blank out N/A
    .str.replace("Nil", "0", regex=False)     # Nil should be 0
    .str.strip()
)

# Convert to integers safely
table11_long["Income"] = (
    pd.to_numeric(table11_long["Income"], errors="coerce")
    .round(0)
    .astype("Int64")
)

# Label this as Trading income
table11_long["IncomeType"] = "Trading"

# clean Year values (remove [b], [note], etc.)
table10_long["Year"] = (
    table11_long["Year"]
    .astype(str)
    .str.replace(r"\[.*?\]", "", regex=True)
    .str.strip()
)

# Preview
table11_long.head()


Unnamed: 0,Museum,Year,Income,IncomeType
0,British Museum,2008/09,3600000,Trading
1,Museum of the Home,2008/09,-1190,Trading
2,Horniman Museum,2008/09,65600,Trading
3,Imperial War Museums,2008/09,4129000,Trading
4,Museum of Science and Industry in Manchester,2008/09,332532,Trading


In [20]:
# --- Cleaning Fundraising Income (Table 12) ---

# Read the sheet with all values as text
raw12 = pd.read_excel(
    dcms_file,
    sheet_name="12",
    header=None,
    dtype=str
)

# Header row is Excel row 4 → raw index = 4
header12 = raw12.iloc[4].astype(str).str.strip()

# Build table starting from row 5 onwards
table12 = raw12.iloc[5:].copy()
table12.columns = header12

# Drop blank museum rows
table12 = table12[table12["Name of museum or gallery"].notna()].copy()

# Rename museum column
table12 = table12.rename(columns={"Name of museum or gallery": "Museum"})

# Clean museum names
table12["Museum"] = (
    table12["Museum"]
    .astype(str)
    .str.replace(r"\[.*?\]", "", regex=True)
    .str.strip()
)

# Identify valid year columns (exclude bracketed ones such as '2010/11 [b]')
year_cols_12 = [
    col for col in table12.columns
    if col != "Museum" and "[" not in str(col)
]

# Reshape to long format
table12_long = table12.melt(
    id_vars="Museum",
    value_vars=year_cols_12,
    var_name="Year",
    value_name="Income"
)

# Clean numeric values
table12_long["Income"] = (
    table12_long["Income"]
    .astype(str)
    .str.replace(",", "", regex=False)     # remove thousands commas
    .str.replace("x", "", regex=False)     # remove x markers
    .str.replace("N/A", "", regex=False)   # remove N/A
    .str.replace("Nil", "0", regex=False)  # convert Nil → 0
    .str.strip()
)

# Convert to nullable integer
table12_long["Income"] = (
    pd.to_numeric(table12_long["Income"], errors="coerce")
    .round(0)
    .astype("Int64")
)

# Label income type
table12_long["IncomeType"] = "Fundraising"

# clean Year values (remove [b], [note], etc.)
table12_long["Year"] = (
    table10_long["Year"]
    .astype(str)
    .str.replace(r"\[.*?\]", "", regex=True)
    .str.strip()
)

# Preview
table12_long.head()


Unnamed: 0,Museum,Year,Income,IncomeType
0,British Museum,2008/09,8000000,Fundraising
1,Museum of the Home,2008/09,52953,Fundraising
2,Horniman Museum,2008/09,398321,Fundraising
3,Imperial War Museums,2008/09,3496000,Fundraising
4,Museum of Science and Industry in Manchester,2008/09,147000,Fundraising


### Step 4: Creating the final clean KPI datasets

With all of the individual KPI tables now cleaned and standardised, the next stage is to organise them into three final datasets ready for analysis. Each dataset is kept in a consistent long format, with one row per museum per year.

For this step I am creating:

- **A visitor dataset** combining the cleaned Total, Under 16, Overseas, and Other Visitor tables  
- **A recommendation dataset** containing the cleaned Recommend a Visit percentages  
- **An income dataset** combining the cleaned Admissions, Trading, and Fundraising income tables  


Each dataset is then exported as a separate CSV with clear names showing that the contents have been fully cleaned:

- `kpi_visitors_clean.csv`  
- `kpi_recommend_clean.csv`
- `kpi_income_clean.csv`  



In [22]:
# --- Create and save CLEAN VISITOR DATASET ---

# Combine all cleaned visitor tables:
# - table1_long (Total Visitors)
# - table3_long (Under 16 Visitors)
# - table4_long (Overseas Visitors)
# - other_visitors_long (Other Visitors)

visitors_clean = pd.concat(
    [table1_long, table3_long, table4_long, other_visitors_long],
    ignore_index=True
)

# Save to CSV
visitors_clean.to_csv(
    r"..\data\kpi_visitors_clean.csv",
    index=False
)

visitors_clean.head()


Unnamed: 0,Museum,Year,VisitorCount,VisitorType
0,British Museum,2008/09,5472056,Total
1,Museum of the Home,2008/09,86499,Total
2,Horniman Museum,2008/09,483113,Total
3,Imperial War Museums,2008/09,2006765,Total
4,Museum of Science and Industry in Manchester,2008/09,745188,Total


In [24]:
# --- Create and save CLEAN RECOMMENDATION DATASET ---

# table6_long already contains:
# - Museum
# - Year
# - RecommendPercent
# and is fully cleaned

recommend_clean = table6_long.copy()

# Save to CSV
recommend_clean.to_csv(
    r"..\data\kpi_recommend_clean.csv",
    index=False
)

recommend_clean.head()


Unnamed: 0,Museum,Year,RecommendPercent
0,British Museum,2008/09,85
1,Museum of the Home,2008/09,86
2,Horniman Museum,2008/09,97
3,Imperial War Museums,2008/09,98
4,Museum of Science and Industry in Manchester,2008/09,78


In [25]:
# --- Create and save CLEAN INCOME DATASET ---

# Combine all three income tables
income_clean = pd.concat(
    [table10_long, table11_long, table12_long],
    ignore_index=True
)

# Save to CSV
income_clean.to_csv(
    r"..\data\kpi_income_clean.csv",
    index=False
)

income_clean.head()


Unnamed: 0,Museum,Year,Income,IncomeType
0,British Museum,2008/09,3100000.0,Admissions
1,Museum of the Home,2008/09,,Admissions
2,Horniman Museum,2008/09,5809.0,Admissions
3,Imperial War Museums,2008/09,4752000.0,Admissions
4,Museum of Science and Industry in Manchester,2008/09,1541874.0,Admissions


### Summary

The three KPI datasets are now cleaned, standardised, and saved.  
Each one follows the same long format, with consistent museum names, tidy year values, and numeric fields converted to usable types. The visitor file brings together the Total, Under-16, Overseas, and Other categories; the income file includes all three income streams; and the recommendation file contains the final percentage scores.

All three outputs are saved in the project’s data folder as:

- kpi_visitors_clean.csv  
- kpi_income_clean.csv  
- kpi_recommend_clean.csv

These files are now ready to use for analysis and visualisation.
