Design and document a sequence of operations to convert a pair of source files into an output file that can be used in subsequent analysis

Implement your design

Test your pipeline with an appropriate test deck and document your test process and results

You have been provided with two zipped files in the folder PartA Data

The first contains data about the time of sunrise, sunset and the length of the day in Edinburgh. The second contains weather data from a weather station in Scotland.

Your brief is to estimate the temperature every minute for the year 2012. This will be used in a further analysis of operational data of electronic equipment in the vicinity. You may use any combination of the technologies you have been introduced to together with any tools that you are already familiar with.

This lab does not give you any detailed information.

In [1]:
import pandas as pd
import numpy as np

In [2]:
EXCEPTIONS = [
    "", " ", "-", "--", "NaN", "NULL", "None",
    "↑", "↓"
]

In [3]:
# Use the first *two* rows as a multi-level header
df = pd.read_excel("data/Edinburgh-daytime.xlsx", header=[0, 1])

# Flatten the MultiIndex columns into simple strings
df.columns = [
    f"{top} {sub}".strip()
    for top, sub in df.columns
]




In [4]:
time_columns = [
    "Sunrise", "Sunset", "Length", "Start", "End", "Time"
]

for col in time_columns:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors="ignore").dt.time


In [5]:
df = df.replace(r"[↑↓()°]", "", regex=True)

In [6]:
df["Solar Noon Mil. km"] = pd.to_numeric(df["Solar Noon Mil. km"], errors="coerce")

In [7]:
df = df.replace(EXCEPTIONS, np.nan)

In [8]:
df_clean = df.dropna(subset=[
    "Sunrise/Sunset Sunrise",
    "Sunrise/Sunset Sunset",
    "Daylength Length"
])

In [9]:
import datetime as dt

def parse_time(value):
    """
    Safely parse 'HH:MM' or 'HH:MM:SS' and ignore everything else.
    Returns a datetime.time or NaN.
    """
    if not isinstance(value, str):
        return np.nan

    # Extract hh:mm or hh:mm:ss from the string
    import re
    match = re.search(r"\b\d{1,2}:\d{2}(?::\d{2})?\b", value)
    if not match:
        return np.nan

    time_str = match.group(0)

    # Try HH:MM:SS
    try:
        return dt.datetime.strptime(time_str, "%H:%M:%S").time()
    except ValueError:
        pass

    # Try HH:MM
    try:
        return dt.datetime.strptime(time_str, "%H:%M").time()
    except ValueError:
        return np.nan

In [10]:
df_clean = df.copy()

time_cols = [
    "Sunrise/Sunset Sunrise",
    "Sunrise/Sunset Sunset",
    "Astronomical Twilight Start",
    "Astronomical Twilight End",
    "Nautical Twilight Start",
    "Nautical Twilight End",
    "Civil Twilight Start",
    "Civil Twilight End",
    "Solar Noon Time"
]

for col in time_cols:
    if col in df_clean.columns:
        df_clean[col] = df_clean[col].apply(parse_time)

In [11]:
#print(df.columns.tolist())
print(df.head(10))
#print(df.describe(include='all'))

   2011 Dec Sunrise/Sunset Sunrise Sunrise/Sunset Sunset Daylength Length  \
0       NaN                    NaN                   NaN              NaN   
1       1.0             08:18  130            15:44  230         07:25:45   
2       2.0             08:20  130            15:43  230         07:23:13   
3       3.0             08:21  130            15:42  229         07:20:48   
4       4.0             08:23  131            15:41  229         07:18:29   
5       5.0             08:24  131            15:41  229         07:16:17   
6       6.0             08:26  131            15:40  229         07:14:11   
7       7.0             08:27  132            15:40  228         07:12:12   
8       8.0             08:29  132            15:39  228         07:10:20   
9       9.0             08:30  132            15:39  228         07:08:36   

  Daylength Difference Astronomical Twilight Start Astronomical Twilight End  \
0                  NaN                         NaN                      