Design and document a sequence of operations to convert a pair of source files into an output file that can be used in subsequent analysis

Implement your design

Test your pipeline with an appropriate test deck and document your test process and results

You have been provided with two zipped files in the folder PartA Data

The first contains data about the time of sunrise, sunset and the length of the day in Edinburgh. The second contains weather data from a weather station in Scotland.

Your brief is to estimate the temperature every minute for the year 2012. This will be used in a further analysis of operational data of electronic equipment in the vicinity. You may use any combination of the technologies you have been introduced to together with any tools that you are already familiar with.

This lab does not give you any detailed information.

In [13]:
import pandas as pd
import numpy as np
import datetime as dt
import re
import os

In [2]:
EXCEPTIONS = ["", " ", "-", "--", "NaN", "NULL", "None"]
TIME_PATTERN = r"\b\d{1,2}:\d{2}(?::\d{2})?\b"


In [3]:
def parse_time(value: str):

    if not isinstance(value, str):
        return np.nan

    match = re.search(TIME_PATTERN, value)
    if not match:
        return np.nan

    time_str = match.group(0)

    for fmt in ("%H:%M:%S", "%H:%M"):
        try:
            return dt.datetime.strptime(time_str, fmt).time()
        except ValueError:
            continue

    return np.nan

In [4]:
def clean_column_names(df):
    """Flatten MultiIndex headers into single string names."""
    df.columns = [f"{top} {sub}".strip() for top, sub in df.columns]
    return df

In [5]:
def remove_symbols(df):
    """Remove arrows, degree signs, and brackets."""
    return df.replace(r"[↑↓()°]", "", regex=True)

In [6]:
df = pd.read_excel("data/Edinburgh-daytime.xlsx", header=[0, 1])
df = clean_column_names(df)


In [7]:
# Remove symbolic characters
df = remove_symbols(df)

# Convert solar distance to numeric
if "Solar Noon Mil. km" in df.columns:
    df["Solar Noon Mil. km"] = pd.to_numeric(df["Solar Noon Mil. km"], errors="coerce")

# Standardize blank or bad values
df = df.replace(EXCEPTIONS, np.nan)


In [14]:
output_path = "data/Edinburgh-daytime-cleaned.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)

df.to_csv(output_path, index=False)
print(f"Cleaned dataset written to: {output_path}")

Cleaned dataset written to: data/Edinburgh-daytime-cleaned.csv
