# Gathering the Data: 

The first step is to collect the relevant data. We will be collecting information from the GPS navigation files for a week, and putting them on a CSV. We will be using GPS Week 1997, which corresponds to 4/15/2018 to 4/21/2018 (day 105 to day 111). If this proof of concept works, further automation for data retrieval needs to be done later 🚧. 

RINEX Version: [Link to RINEX 3.02 PDF](https://files.igs.org/pub/data/format/rinex302.pdf)

See A18

## Step 0A: Gather the Data

In [None]:
import os 
import chardet
import pandas as pd

In [None]:
def process_rnx_to_csv(year: str, day_of_year: str):
    base_dir = os.path.dirname(__file__) if "__file__" in globals() else os.getcwd()
    rnx_folder = os.path.join(base_dir, 'rnx', f'gps_rnx_daily_{year}{day_of_year}')
    print(f"{rnx_folder} exists.")
    
    if not os.path.isdir(rnx_folder):
        print(f"Directory {rnx_folder} does not exist.")
        return

    all_data = []

    for file_name in os.listdir(rnx_folder):
        if not file_name.endswith(".rnx"):
            continue

        file_path = os.path.join(rnx_folder, file_name)

        try:
            with open(file_path, 'rb') as f:
                raw = f.read()
                encoding = chardet.detect(raw)['encoding'] or 'utf-8'

            with open(file_path, 'r', encoding=encoding) as f:
                first_line = f.readline()
                is_version_3 = '3' in first_line[0:21]
                if not is_version_3:
                    print(f"Skipping non-version 3 file: {file_name}")
                    continue

                # Read to end of header
                while 'END OF HEADER' not in f.readline():
                    pass

                while True:
                    line = f.readline()
                    if not line:
                        break
                    if not line.startswith('G'):
                        continue

                    subs_lines = [f.readline() for _ in range(6)]
                    if not all(subs_lines):
                        continue  # Skip incomplete entry

                    try:
                        entry = {
                            'SV Name': line[0:3],
                            'Epoch Year': int(line[3:8]),
                            'Epoch Month': int(line[8:11]),
                            'Epoch Day': int(line[11:14]),
                            'Epoch Hour': int(line[14:17]),
                            'Epoch Minute': int(line[17:20]),
                            'Epoch Second': int(line[20:23]),
                            'Clock Bias': float(line[23:42].lower().replace('d', 'e')),
                            'Clock Drift': float(line[42:61].lower().replace('d', 'e')),
                            'Clock Drift Rate': float(line[61:80].lower().replace('d', 'e')),

                            'IODE': float(subs_lines[0][4:23].lower().replace('d', 'e')),
                            'Crs': float(subs_lines[0][23:42].lower().replace('d', 'e')),
                            'Delta n': float(subs_lines[0][42:61].lower().replace('d', 'e')),
                            'M0': float(subs_lines[0][61:80].lower().replace('d', 'e')),

                            'Cuc': float(subs_lines[1][4:23].lower().replace('d', 'e')),
                            'Eccentricity': float(subs_lines[1][23:42].lower().replace('d', 'e')),
                            'Cus': float(subs_lines[1][42:61].lower().replace('d', 'e')),
                            'sqrtA': float(subs_lines[1][61:80].lower().replace('d', 'e')),

                            'Toe': float(subs_lines[2][4:23].lower().replace('d', 'e')),
                            'Cic': float(subs_lines[2][23:42].lower().replace('d', 'e')),
                            'Omega0': float(subs_lines[2][42:61].lower().replace('d', 'e')),
                            'Cis': float(subs_lines[2][61:80].lower().replace('d', 'e')),

                            'Io': float(subs_lines[3][4:23].lower().replace('d', 'e')),
                            'Crc': float(subs_lines[3][23:42].lower().replace('d', 'e')),
                            'omega': float(subs_lines[3][42:61].lower().replace('d', 'e')),
                            'OmegaDot': float(subs_lines[3][61:80].lower().replace('d', 'e')),

                            'IDOT': float(subs_lines[4][4:23].lower().replace('d', 'e')),
                            'Codes on L2': float(subs_lines[4][23:42].lower().replace('d', 'e')),
                            'GPS Week': float(subs_lines[4][42:61].lower().replace('d', 'e')),
                            'L2 P flag': float(subs_lines[4][61:80].lower().replace('d', 'e')),

                            'SV Accuracy': float(subs_lines[5][4:23].lower().replace('d', 'e')),
                            'SV Health': float(subs_lines[5][23:42].lower().replace('d', 'e')),
                            'TGD': float(subs_lines[5][42:61].lower().replace('d', 'e')),
                            'IODC': float(subs_lines[5][61:80].lower().replace('d', 'e')),
                            'File': file_name
                        }
                        all_data.append(entry)

                    except Exception as parse_err:
                        print(f"Parse error in {file_name}: {parse_err}")
                        continue

        except Exception as read_err:
            print(f"Error reading {file_name}: {read_err}")

    if all_data:
        df = pd.DataFrame(all_data)
        output_csv = os.path.join(base_dir, f'gps_rnx_{year}{day_of_year}.csv')
        df.to_csv(output_csv, index=False)
        print(f"Saved {df.shape} entries to {output_csv}")
    else:
        print(f"No data parsed for day {day_of_year}.")

In [None]:
# Example run
for day in range(103, 114):  # days 103 to 113
    process_rnx_to_csv("2018", f"{day:03d}")

In [None]:
import os
import pandas as pd

def process_clk_to_csv(week: int):
    base_dir = os.path.dirname(__file__) if "__file__" in globals() else os.getcwd()
    clk_folder = os.path.join(base_dir, "clk", f"gps_{week}")

    if not os.path.isdir(clk_folder):
        print(f"Directory {clk_folder} does not exist.")
        return

    all_data = []

    for file_name in os.listdir(clk_folder):
        if not file_name.lower().endswith(".clk"):
            continue

        file_path = os.path.join(clk_folder, file_name)

        try:
            with open(file_path, "r") as f:
                first_line = f.readline()
                second_line = f.readline()
                is_version_3 = "3.04" in first_line[0:21]
                end_header = False

                for line in f:
                    if not end_header:
                        if "END OF HEADER" in line:
                            end_header = True
                        continue

                    if not line.startswith("AS"):
                        continue

                    try:
                        # Version 3
                        if is_version_3 and "CNES" in second_line[21:42].upper() and line[3:13].startswith("G"):
                            row = {
                                "Clock Data Type": line[0:3],
                                "SV Name": line[3:13].strip(),
                                "Epoch Year": int(line[13:18]),
                                "Epoch Month": int(line[18:21]),
                                "Epoch Day": int(line[21:24]),
                                "Epoch Hour": int(line[24:27]),
                                "Epoch Minute": int(line[27:30]),
                                "Epoch Second": int(float(line[30:40])),
                                "Clock Bias (seconds)": float(line[45:66].lower().replace("d", "e")),
                                "Version": "3.04",
                                "File": file_name
                            }
                            all_data.append(row)

                        # Pre-3
                        elif not is_version_3 and "CNES" in second_line[20:40].upper() and line[3:8].startswith("G"):
                            row = {
                                "Clock Data Type": line[0:3],
                                "SV Name": line[3:8].strip(),
                                "Epoch Year": int(line[8:12]),
                                "Epoch Month": int(line[12:15]),
                                "Epoch Day": int(line[15:18]),
                                "Epoch Hour": int(line[18:21]),
                                "Epoch Minute": int(line[21:24]),
                                "Epoch Second": int(float(line[24:34])),
                                "Clock Bias (seconds)": float(line[40:59].lower().replace("d", "e")),
                                "Version": "Pre-3.04",
                                "File": file_name
                            }
                            all_data.append(row)

                    except Exception as parse_err:
                        print(f"Parse error in {file_name}: {parse_err}")
                        continue

        except Exception as read_err:
            print(f"Error reading {file_name}: {read_err}")

    if all_data:
        df = pd.DataFrame(all_data)
        df.sort_values(by=["Epoch Year", "Epoch Month", "Epoch Day", "Epoch Hour", "Epoch Minute", "Epoch Second"], inplace=True)
        output_csv = os.path.join(base_dir, f"gps_clk_week_{week}.csv")
        df.to_csv(output_csv, index=False)
        print(f"Saved {len(df)} entries to {output_csv}")
    else:
        print(f"No valid data found for GPS week {week}.")

In [None]:
# Example call
for week in range(1997, 1998):  # Just week 1997 for now
    process_clk_to_csv(week)

## Step 0B: Consolidate all RINEX files into single RNX info file

In [None]:
import pandas as pd
import glob
import os

folder_path = "."  # current folder
csv_files = glob.glob(os.path.join(folder_path, "gps_rnx_*.csv")) # glob is a python module to search for file path names that match a specific pattern
combined_df = pd.concat([pd.read_csv(f) for f in csv_files], ignore_index=True)
combined_df.to_csv("rnx_1997_raw.csv", index=False)

## Step 0C: Fix SV Naming Issue with RNX files 

The RNX files have the SV names listed as G 1 vs the CLK files have them listed as G01. Let's ensure that they're named using the same format. 

In [None]:
import pandas as pd 
rnx_df = pd.read_csv("rnx_1997_raw.csv")
rnx_df['SV Name'] = rnx_df['SV Name'].str.replace(r'^G (\d)$', r'G0\1', regex=True)
rnx_df.to_csv("rnx_1997_raw.csv")

## Step 1: Match up and add CLK bias from IGS to RNX file yield a comined dataset file

In [None]:
import pandas as pd 

rnx_df = pd.read_csv("rnx_1997_raw.csv")
clk_df = pd.read_csv("gps_clk_week_1997.csv")

merge_cols = ["SV Name", "Epoch Year", "Epoch Month", "Epoch Day", "Epoch Hour", "Epoch Minute", "Epoch Second"]

merged_df = pd.merge(rnx_df, 
                    clk_df[merge_cols + ["Clock Bias (seconds)"]], 
                    on = merge_cols, 
                    how = "left")

merged_df.to_csv("1_combined_raw_dataset.csv", index = False)

print(f"Merged file saved with shape: {merged_df.shape}")

## Step 2: Drop Rows without a CLK bias (seconds) or with missing data entries - we want a complete dataset without gaps in the data

In [None]:
combined_df = pd.read_csv("1_combined_raw_dataset.csv")

complete_df = combined_df.dropna() # drop rows w at least 1 NaN

complete_df.to_csv("2_complete_raw_dataset.csv", index = False)

print(f"Saved cleaned complete dataset and went from {combined_df.shape} to {complete_df.shape}")

In [None]:
epoch_cols = ['Epoch Hour', 'Epoch Minute', 'Epoch Second']

# Check if any value is not zero in each column
for col in epoch_cols:
    non_zero = complete_df[complete_df[col] != 0]
    if not non_zero.empty:
        print(f"Non-zero values found in '{col}':")
        print(non_zero[[col]])
    else:
        print(f"All values in '{col}' are zero.")


All 'seconds' columns are 0, but for every minute, hour, day, month and year there are values. 

## Step 3: Adding a datetime object column to track sample rate

In [None]:
complete_df = pd.read_csv("2_complete_raw_dataset.csv")

# create datetime col
complete_df["epoch"] = pd.to_datetime({
    "year": complete_df["Epoch Year"],
    "month": complete_df["Epoch Month"],
    "day": complete_df["Epoch Day"],
    "hour": complete_df["Epoch Hour"],
    "minute": complete_df["Epoch Minute"],
    "second": complete_df["Epoch Second"]
})

complete_df.to_csv("3_create_datetime_objects.csv")

In [None]:
complete_df

## Step 3A: Visualize Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

complete_df = pd.read_csv("3_create_datetime_objects.csv")

# Group by 'SV Name'
grouped = complete_df.groupby('SV Name')

# Create one plot per SV Name
for sv_name, group in grouped:
    # Sort by epoch to ensure proper plotting
    group = group.sort_values('epoch')

    plt.figure(figsize=(15, 5))
    
    # Line plot for continuity
    plt.plot(group['epoch'], group['Clock Bias (seconds)'], label='Line Plot')
    
    # Scatter plot for actual data points
    plt.scatter(group['epoch'], group['Clock Bias (seconds)'], color='red', s=10, label='Data Points')
    
    plt.title(f'Clock Bias Over Time - {sv_name}')
    plt.xlabel('Datetime')
    plt.ylabel('Clock Bias (seconds)')
    plt.xticks(rotation=90)
    plt.grid(True)
    plt.tight_layout()
    plt.legend()
    plt.show()


## Step 4: Remove unnecessary columns

In [None]:
datetime_df = pd.read_csv("3_create_datetime_objects.csv")

In [None]:
datetime_df['epoch'] = pd.to_datetime(datetime_df['epoch'])

In [None]:
cols_to_drop = [
    'Unnamed: 0.1', 'Unnamed: 0',
    'Epoch Year', 'Epoch Month', 'Epoch Day',
    'Epoch Hour', 'Epoch Minute', 'Epoch Second'
]

In [None]:
datetime_df = datetime_df.drop(columns=cols_to_drop)

In [None]:
duplicates = datetime_df[datetime_df.duplicated(subset=['SV Name', 'epoch'], keep=False)]

In [None]:
to_delete = duplicates[duplicates['SV Health'] != 0]

In [None]:
cleaned_df = datetime_df.drop(index=to_delete.index)

In [None]:
still_duplicated = cleaned_df.duplicated(subset=['SV Name', 'epoch'], keep='first')

In [None]:
final_df = cleaned_df[~still_duplicated]

In [None]:
to_delete.to_csv("deleted_rows_due_to_sv_health.csv", index=False)

In [None]:
final_df = final_df.set_index('epoch')

In [None]:
final_df.to_csv('4_remove_unnecessary_cols_and_repeats.csv')

In [None]:
final_df

## Step 5: Make dataset a regular time series by interpolating missing values per SV group

In [None]:
pruned_df = pd.read_csv("4_remove_unnecessary_cols_and_repeats.csv")

In [None]:
pruned_df['SV Name'] = pruned_df['SV Name'].str.replace('G', '', regex=False).astype(int)

In [None]:
pruned_df

In [None]:
pruned_df['epoch'] = pd.to_datetime(pruned_df['epoch'])
pruned_df = pruned_df.set_index('epoch')

In [None]:
pruned_df = pruned_df.sort_values(['SV Name', 'epoch'])
pruned_df

In [None]:
resampled_df = (
    pruned_df
    .groupby('SV Name')
    .resample('1T')  # 1-minute frequency
    .asfreq()
    .interpolate(method='linear', limit_area='inside')
)

In [None]:
resampled_df

In [None]:
resampled_df.to_csv("5_resampled_1min_interval.csv")

## Step 6: Normalize Continuous 

In [None]:
resampled_df = pd.read_csv("5_resampled_1min_interval.csv")

In [None]:
resampled_df.T

In [None]:
import pandas as pd

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
cols_to_scale = resampled_df.columns.difference(['epoch', 'SV Name', 'File'])

In [None]:
scaler = MinMaxScaler()

In [None]:
resampled_df[cols_to_scale] = scaler.fit_transform(resampled_df[cols_to_scale])

In [None]:
print(resampled_df.head())

In [None]:
resampled_df.to_csv("6_scaled_dataset.csv")

## Step 7: Normalizing Time Idx

In [None]:
scaled_df = pd.read_csv("6_scaled_dataset.csv")

In [2]:
# using raw dataset without scaling 
import pandas as pd
scaled_df = pd.read_csv("5_resampled_1min_interval.csv")

In [3]:
scaled_df

Unnamed: 0,SV Name,epoch,SV Name.1,Clock Bias,Clock Drift,Clock Drift Rate,IODE,Crs,Delta n,M0,...,IDOT,Codes on L2,GPS Week,L2 P flag,SV Accuracy,SV Health,TGD,IODC,File,Clock Bias (seconds)
0,1,2018-04-15 00:00:00,1.0,-0.000041,-2.614797e-12,0.0,89.000000,42.562500,4.851274e-09,1.327367,...,1.678641e-10,1.0,1997.0,0.0,3.400,0.0,5.587935e-09,89.000000,KITG00UZB_R_20181040000_01D_GN.rnx,-0.000041
1,1,2018-04-15 00:01:00,1.0,-0.000041,-2.614797e-12,0.0,89.083333,42.645573,4.851029e-09,1.336121,...,1.692809e-10,1.0,1997.0,0.0,3.395,0.0,5.587935e-09,89.083333,,-0.000041
2,1,2018-04-15 00:02:00,1.0,-0.000041,-2.614797e-12,0.0,89.166667,42.728646,4.850785e-09,1.344874,...,1.706976e-10,1.0,1997.0,0.0,3.390,0.0,5.587935e-09,89.166667,,-0.000041
3,1,2018-04-15 00:03:00,1.0,-0.000041,-2.614797e-12,0.0,89.250000,42.811719,4.850541e-09,1.353627,...,1.721143e-10,1.0,1997.0,0.0,3.385,0.0,5.587935e-09,89.250000,,-0.000041
4,1,2018-04-15 00:04:00,1.0,-0.000041,-2.614797e-12,0.0,89.333333,42.894792,4.850297e-09,1.362380,...,1.735310e-10,1.0,1997.0,0.0,3.380,0.0,5.587935e-09,89.333333,,-0.000041
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
308186,32,2018-04-21 21:56:00,32.0,-0.000518,3.979039e-12,0.0,81.933333,5.603125,4.653539e-09,-2.668432,...,-1.017304e-10,1.0,1997.0,0.0,2.000,0.0,4.656613e-10,81.933333,,-0.000518
308187,32,2018-04-21 21:57:00,32.0,-0.000518,3.979039e-12,0.0,81.950000,5.608594,4.652971e-09,-2.712040,...,-1.017453e-10,1.0,1997.0,0.0,2.000,0.0,4.656613e-10,81.950000,,-0.000518
308188,32,2018-04-21 21:58:00,32.0,-0.000518,3.979039e-12,0.0,81.966667,5.614063,4.652402e-09,-2.755649,...,-1.017602e-10,1.0,1997.0,0.0,2.000,0.0,4.656613e-10,81.966667,,-0.000518
308189,32,2018-04-21 21:59:00,32.0,-0.000518,3.979039e-12,0.0,81.983333,5.619531,4.651834e-09,-2.799257,...,-1.017751e-10,1.0,1997.0,0.0,2.000,0.0,4.656613e-10,81.983333,,-0.000518


In [None]:
# Fix datetime column
scaled_df["epoch"] = pd.to_datetime(scaled_df["epoch"])

# Create time index in units of 30-second intervals since global start
scaled_df["time_idx"] = ((scaled_df["epoch"] - scaled_df["epoch"].min()).dt.total_seconds() / 60).astype(int)

In [None]:
scaled_df.to_csv("7_time_idx_added.csv")

In [None]:
scaled_df

## Delete Weird Rows 

In [2]:
#  using unscaled dataset 
import pandas as pd
weird_rows_delete = pd.read_csv("5_resampled_1min_interval.csv")
weird_rows_delete

Unnamed: 0,SV Name,epoch,SV Name.1,Clock Bias,Clock Drift,Clock Drift Rate,IODE,Crs,Delta n,M0,...,IDOT,Codes on L2,GPS Week,L2 P flag,SV Accuracy,SV Health,TGD,IODC,File,Clock Bias (seconds)
0,1,2018-04-15 00:00:00,1.0,-0.000041,-2.614797e-12,0.0,89.000000,42.562500,4.851274e-09,1.327367,...,1.678641e-10,1.0,1997.0,0.0,3.400,0.0,5.587935e-09,89.000000,KITG00UZB_R_20181040000_01D_GN.rnx,-0.000041
1,1,2018-04-15 00:01:00,1.0,-0.000041,-2.614797e-12,0.0,89.083333,42.645573,4.851029e-09,1.336121,...,1.692809e-10,1.0,1997.0,0.0,3.395,0.0,5.587935e-09,89.083333,,-0.000041
2,1,2018-04-15 00:02:00,1.0,-0.000041,-2.614797e-12,0.0,89.166667,42.728646,4.850785e-09,1.344874,...,1.706976e-10,1.0,1997.0,0.0,3.390,0.0,5.587935e-09,89.166667,,-0.000041
3,1,2018-04-15 00:03:00,1.0,-0.000041,-2.614797e-12,0.0,89.250000,42.811719,4.850541e-09,1.353627,...,1.721143e-10,1.0,1997.0,0.0,3.385,0.0,5.587935e-09,89.250000,,-0.000041
4,1,2018-04-15 00:04:00,1.0,-0.000041,-2.614797e-12,0.0,89.333333,42.894792,4.850297e-09,1.362380,...,1.735310e-10,1.0,1997.0,0.0,3.380,0.0,5.587935e-09,89.333333,,-0.000041
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
308186,32,2018-04-21 21:56:00,32.0,-0.000518,3.979039e-12,0.0,81.933333,5.603125,4.653539e-09,-2.668432,...,-1.017304e-10,1.0,1997.0,0.0,2.000,0.0,4.656613e-10,81.933333,,-0.000518
308187,32,2018-04-21 21:57:00,32.0,-0.000518,3.979039e-12,0.0,81.950000,5.608594,4.652971e-09,-2.712040,...,-1.017453e-10,1.0,1997.0,0.0,2.000,0.0,4.656613e-10,81.950000,,-0.000518
308188,32,2018-04-21 21:58:00,32.0,-0.000518,3.979039e-12,0.0,81.966667,5.614063,4.652402e-09,-2.755649,...,-1.017602e-10,1.0,1997.0,0.0,2.000,0.0,4.656613e-10,81.966667,,-0.000518
308189,32,2018-04-21 21:59:00,32.0,-0.000518,3.979039e-12,0.0,81.983333,5.619531,4.651834e-09,-2.799257,...,-1.017751e-10,1.0,1997.0,0.0,2.000,0.0,4.656613e-10,81.983333,,-0.000518


In [None]:
weird_rows_delete = pd.read_csv("7_time_idx_added.csv")

In [None]:
weird_rows_delete

In [None]:
# using unscaled dataset 
cols_to_drop = ['Unnamed: 0.1', 'Unnamed: 0', 'SV Name.1']

In [None]:
cols_to_drop = ['Unnamed: 0.1', 'Unnamed: 0', 'SV Name.1']

In [None]:
weird_rows_delete = weird_rows_delete.drop(columns=cols_to_drop, errors='ignore')

In [None]:
weird_rows_delete.to_csv("8_no_bad_rows.csv", index=False)

## EXTRA

In [None]:
full = pd.read_csv("6_scaled_dataset.csv")

In [None]:
# Get sorted unique IODE values
unique_iode_sorted = sorted(df["IODE"].dropna().unique())  # dropna in case of any missing values

# Build a mapping like {12: 1, 18: 2, 29: 3, ...}
iode_mapping = {val: i + 1 for i, val in enumerate(unique_iode_sorted)}

# Apply the mapping
df["IODE Encoded"] = df["IODE"].map(iode_mapping)

# Preview mapping
print("IODE → Encoded Value (sorted):")
for val in list(iode_mapping)[:1000]:
    print(f"{val} → {iode_mapping[val]}")

In [None]:
# Get sorted unique Codes on L2 values
unique_l2_codes_sorted = sorted(df["Codes on L2"].dropna().unique())

# Build mapping like {1: 1, 2: 2, 5: 3, ...}
l2_code_mapping = {val: i + 1 for i, val in enumerate(unique_l2_codes_sorted)}

# Apply the mapping
df["Codes on L2 Encoded"] = df["Codes on L2"].map(l2_code_mapping)

# Preview mapping
print("Codes on L2 → Encoded Value (sorted):")
for val in list(l2_code_mapping)[:10]:
    print(f"{val} → {l2_code_mapping[val]}")


In [None]:
# Get sorted unique SV Health values
unique_sv_health_sorted = sorted(df["SV Health"].dropna().unique())

# Build mapping like {0: 1, 1: 2, 2: 3, ...}
sv_health_mapping = {val: i + 1 for i, val in enumerate(unique_sv_health_sorted)}

# Apply the mapping
df["SV Health Encoded"] = df["SV Health"].map(sv_health_mapping)

# Preview mapping
print("SV Health → Encoded Value (sorted):")
for val in list(sv_health_mapping)[:10]:
    print(f"{val} → {sv_health_mapping[val]}")


In [None]:
# Get sorted unique IODC values
unique_iodc_sorted = sorted(df["IODC"].dropna().unique())

# Build mapping like {1: 1, 5: 2, 6: 3, ...}
iodc_mapping = {val: i + 1 for i, val in enumerate(unique_iodc_sorted)}

# Apply the mapping
df["IODC Encoded"] = df["IODC"].map(iodc_mapping)

# Preview mapping
print("IODC → Encoded Value (sorted):")
for val in list(iodc_mapping)[:1000]:
    print(f"{val} → {iodc_mapping[val]}")

In [None]:
df

In [None]:
# Keep only rows where Epoch Minute == 0
df = df[df["Epoch Minute"] == 0]

# Reset index if needed
df = df.reset_index(drop=True)

# Confirm it's working
print("Unique values in 'Epoch Minute':", df["Epoch Minute"].unique())


In [None]:
df

In [None]:
sv_counts = df["SV Name"].value_counts()

print("Number of data points per SV Name:")
print(sv_counts)


In [None]:
from sklearn.preprocessing import MinMaxScaler

# Define your continuous columns
continuous_cols = [col for col in df.columns 
                   if col not in ["epoch", "SV Name", "IODE", "Codes on L2", 
                                  "SV Health", "IODC", 
                                  "Epoch Year", "Epoch Month", "Epoch Day", 
                                  "Epoch Hour", "Epoch Minute", "Epoch Second", "File"]]

# Apply Min-Max scaling to each column independently
for col in continuous_cols:
    scaler = MinMaxScaler()
    df[col] = scaler.fit_transform(df[[col]])


In [None]:
df

In [None]:
output_path = "complete_dataset_scaled.csv"
df.to_csv(output_path, index=False)
print(f"Saved to {output_path}")

# Notes: 

The previous part of this project investigated whether broadcast clock bias could be used to forecast a correction that would yield a more accurate clock bias correction. This was investigated by leveraging the IGS post-processed products as the ground truth clock bias, then the difference from it and the broadcast clock bias was evaluated and this value was used as the model's label. The previous model was an encoder only transformer, the subsequent model was a Temporal Fusion Transformer (specifically designed to forecast multi-variate time series). While both models achieved low RMSE metrics, their R2 values did not correlate with a model of optimal performance (i.e. model was likely just outputting a bias that was close enough but not truly predicting a specific behavior). This may be attributed to the clock bias "correction" data being largely a random walk process (with maybe an underlying bias?). While the augmented dickey-fuller test showed results in line with the data not being a random walk process even after testing after subtracting the data's bias from it, other tests demonstrated that it likely is. Other tests included: Fourier analysis (outcome: peaked at 0), data visualization (showed an almost perfect Gaussian distribution - despite tests resulting in a negative outcome for Gaussian classification). This holistic review leads to the conclusion that the data labels are largely noise, and unable to be modeled. 

This prompts a change in direction for the project: 

My thesis statement is the following - I am aiming to build low infrastructure ground systems for use on the Moon that leverage novel technology that demonstrated that far GNSS signals can be seen from the Moon. This means that in order to operate a Moon rover, we may be able to leverage these sparse GNSS signals, combined with machine learning, to achieve accurate timing on the Moon. 

How will this be done: This model will use ephemeris data as inputs and IGS post-processed clock bias products as the labels. Our objective is to model the final clock bias during times when limited inputs are known. 

This document will demonstrate if this is a possibility with a minimum viable product that uses 1 week of data. The data is expected to be split in a 80/20 fashion for training/validation.