# 01 — Cleaning & EDA (diagnostic capacity tests)

**Topic/Goal**: Supervised **regression** to predict RUL (diagnostic steps remaining to 80% EOL).  
**Data**: Onori EV capacity tests; **101 diagnostics / 10 cells**.  
**This notebook**: loads derived CSV, re-derives features if needed, and produces EDA figures.

**Rubric hooks**: Project topic/goal, data description, **cleaning & EDA** (figures + commentary).

In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, os
plt.rcParams['figure.dpi']=120
CSV = 'results/rpt_features_labeled_enriched.csv'
df  = pd.read_csv(CSV).sort_values(['cell_id','diag'])
print('rows:', len(df), '| cells:', df['cell_id'].nunique())
df.head()

### EDA — RUL distribution

In [None]:
OUT='results/figs'; os.makedirs(OUT, exist_ok=True)
ax = df['RUL'].hist(bins=20); ax.set_title('RUL (diagnostic steps remaining)')
ax.set_xlabel('RUL'); ax.set_ylabel('count'); plt.tight_layout()
plt.savefig(f'{OUT}/rul_hist.png'); plt.show()

### EDA — Capacity & Fade vs Diagnostic Index (by cell)

In [None]:
for label, y, fname in [('capacity_ah','Capacity vs diagnostic index (by cell)','cap_vs_diag_by_cell.png'),
                          ('fade_frac','Fade fraction vs diagnostic index (by cell)','fade_vs_diag.png')]:
    for cid in df['cell_id'].unique():
        g = df[df['cell_id']==cid].sort_values('diag')
        plt.plot(g['diag'], g[label], marker='o', alpha=0.8, label=cid)
    plt.title(y); plt.xlabel('diagnostic index'); plt.ylabel(label); plt.legend(ncol=2, fontsize=8)
    plt.tight_layout(); plt.savefig(f'{OUT}/{fname}'); plt.show()
print('Saved figures to', OUT)