# Pendulum cleaning pipeline (from Data Wrangler)

This notebook reproduces the GUI cleaning steps from Data Wrangler, adds a few domain-specific columns, and writes a clean dataset to `data/processed/pendulum_clean.csv`.

**Inputs**: `data/raw/pendulum_messy.csv`  
**Outputs**: `data/processed/pendulum_clean.csv`


In [None]:
import numpy as np
import pandas as pd


pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

## 1) Data Wrangler export (GUI steps)

The next cell was inserted by Data Wrangler. It reads the raw CSV and applies the point-and-click transforms (replace units, change types, filters, etc.) to produce `df_clean`.


In [None]:
import pandas as pd


def clean_data(df):
    # Change column type to datetime64[ns] for column: 'timestamp'
    df = df.astype({"timestamp": "datetime64[ns]"})
    # Replace all instances of "[^0-9\\.\\,\\-]" with "" in columns: 'mass_g', 'length_cm', 'time_s'
    df["mass_g"] = df["mass_g"].str.replace("[^0-9\\.\\,\\-]", "", case=False, regex=True)
    df["length_cm"] = df["length_cm"].str.replace("[^0-9\\.\\,\\-]", "", case=False, regex=True)
    df["time_s"] = df["time_s"].str.replace("[^0-9\\.\\,\\-]", "", case=False, regex=True)
    # Replace all instances of "," with "." in columns: 'mass_g', 'time_s', 'length_cm'
    df["mass_g"] = df["mass_g"].str.replace(",", ".", case=False, regex=False)
    df["time_s"] = df["time_s"].str.replace(",", ".", case=False, regex=False)
    df["length_cm"] = df["length_cm"].str.replace(",", ".", case=False, regex=False)
    # Change column type to float64 for columns: 'mass_g', 'length_cm', 'time_s'
    df = df.astype({"mass_g": "float64", "length_cm": "float64", "time_s": "float64"})
    # Replace all instances of "," with "." in column: 'voltage_V'
    df["voltage_V"] = df["voltage_V"].str.replace(",", ".", case=False, regex=False)
    # Filter rows based on column: 'voltage_V'
    df = df[
        ~(
            (df["voltage_V"].str.contains("err", regex=False, na=False, case=False))
            | (df["voltage_V"].isna())
        )
    ]
    # Change column type to float64 for column: 'voltage_V'
    df = df.astype({"voltage_V": "float64"})
    # Filter rows based on column: 'mass_g'
    df = df[~(df["mass_g"].isna())]
    return df


# Loaded variable 'df' from URI: c:\Users\jkill\OneDrive\Desktop\JDW-DEV\GitHub\JDwinkle\PYTHON\data\raw\pendulum_demo.csv
df = pd.read_csv(
    r"c:\Users\jkill\OneDrive\Desktop\JDW-DEV\GitHub\JDwinkle\PYTHON\data\raw\pendulum_demo.csv"
)

df_clean = clean_data(df.copy())
df_clean.head()

## 3) Derived columns

Add physics-friendly columns:
- `mass_kg = mass_g / 1000`
- `length_m = length_cm / 100`
- `g_est = 4π²·length_m / time_s²`


In [None]:
df_clean["mass_kg"] = df_clean["mass_g"] / 1000.0
df_clean["length_m"] = df_clean["length_cm"] / 100.0
df_clean["g_est"] = 4 * np.pi**2 * df_clean["length_m"] / (df_clean["time_s"] ** 2)

df_clean.head()

## 4) Write processed dataset

Materialize the cleaned dataset so other notebooks/apps can consume it.


In [None]:
out_path = "../data/processed/pendulum_clean.csv"
df_clean.to_csv(out_path, index=False)
out_path

## 5) Quick visuals

Interactive Plotly and a publication-style Matplotlib figure.


In [None]:
import plotly.express as px


fig = px.scatter(
    df_clean,
    x="timestamp",
    y="g_est",
    color="sample_id",
    trendline="ols",
    title="Estimated g over time by sample",
)
fig

In [None]:
import matplotlib.pyplot as plt


plt.figure()
y = df_clean["g_est"].to_numpy()
yerr = np.maximum(0.01 * y, 1e-6)  # demo error bars (replace with real σ if you have it)
plt.errorbar(df_clean["length_m"], y, yerr=yerr, fmt="o", capsize=3)
plt.xlabel("Length (m)")
plt.ylabel("Estimated g (m/s²)")
plt.grid(True)
plt.tight_layout()
plt.savefig("../reports/figures/g_vs_length.png", dpi=300)
plt.show()