# Battery RUL Data Generation - 2 Year Full Dataset

This notebook generates 2 years of synthetic battery telemetry data for 216 batteries across 9 Thai data centers.

**Expected Output**: 227+ million telemetry records

**Runtime**: ~4-6 hours on Kaggle

**Resources**: Enable GPU accelerator (though data generation is CPU-bound, Kaggle GPU instances have better CPUs)

## Step 1: Install Dependencies

In [None]:
!pip install numpy pandas scipy pytz faker tqdm matplotlib seaborn -q

## Step 2: Clone Repository

In [None]:
!git clone https://github.com/khiwniti/battery-rul-data-generation.git
%cd battery-rul-data-generation

## Step 3: Verify Files

In [None]:
!ls -lh

## Step 4: Generate Full 2-Year Dataset

This will generate:
- 216 batteries (24 per location)
- 9 Thai data centers with regional climate variations
- 730 days (2 years) of telemetry
- 60-second sampling interval
- Physics-based degradation models
- Power outages and HVAC failures

**WARNING**: This cell will run for 4-6 hours. Do not close the browser tab.

In [None]:
!python generate_full_dataset.py \
    --days 730 \
    --batteries-per-location 24 \
    --sampling-seconds 60 \
    --output-dir ./output/production_2years

## Step 5: Check Generated Files

In [None]:
!ls -lh output/production_2years/
!du -sh output/production_2years/*

## Step 6: Load and Verify Data

In [None]:
import pandas as pd
import json

# Load telemetry data
print("Loading raw telemetry...")
telemetry = pd.read_csv('output/production_2years/telemetry_jar_raw.csv.gz', nrows=10000)
print(f"Telemetry shape (sample): {telemetry.shape}")
print(f"\nColumns: {telemetry.columns.tolist()}")
print(f"\nFirst few rows:")
print(telemetry.head())

# Load battery states
print("\n" + "="*80)
print("Loading battery states...")
with open('output/production_2years/battery_states.json') as f:
    battery_states = json.load(f)
print(f"Total batteries: {len(battery_states)}")

# Summary statistics
states_df = pd.DataFrame(battery_states).T
print(f"\nSOH Statistics:")
print(states_df['soh_pct'].describe())
print(f"\nDegradation Profile Distribution:")
print(states_df['degradation_profile'].value_counts())
print(f"\nFailed Batteries: {states_df['has_failed'].sum()}")

## Step 7: Create Dataset Archive for Download

In [None]:
!tar -czf battery_rul_2year_dataset.tar.gz output/production_2years/
!ls -lh battery_rul_2year_dataset.tar.gz

## Step 8: Visualization (Optional)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Load sample data for visualization
sample_telemetry = pd.read_csv('output/production_2years/telemetry_jar_raw.csv.gz', nrows=100000)
sample_telemetry['timestamp'] = pd.to_datetime(sample_telemetry['timestamp'])

# Plot voltage distribution
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.histplot(sample_telemetry['voltage_v'], bins=50)
plt.title('Voltage Distribution')
plt.xlabel('Voltage (V)')

# Plot temperature distribution
plt.subplot(1, 2, 2)
sns.histplot(sample_telemetry['temperature_c'], bins=50)
plt.title('Temperature Distribution')
plt.xlabel('Temperature (°C)')

plt.tight_layout()
plt.show()

# SOH distribution
plt.figure(figsize=(8, 5))
sns.histplot(states_df['soh_pct'], bins=30, kde=True)
plt.title('Battery State of Health Distribution')
plt.xlabel('SOH (%)')
plt.ylabel('Count')
plt.show()

print(f"\nDataset Statistics:")
print(f"Voltage range: {sample_telemetry['voltage_v'].min():.2f}V - {sample_telemetry['voltage_v'].max():.2f}V")
print(f"Temperature range: {sample_telemetry['temperature_c'].min():.1f}°C - {sample_telemetry['temperature_c'].max():.1f}°C")
print(f"Mean SOH: {states_df['soh_pct'].mean():.1f}%")

## Download Instructions

### Method 1: Direct Download (Recommended)
1. Click the folder icon on the left sidebar
2. Navigate to `battery_rul_2year_dataset.tar.gz`
3. Click the three dots (⋮) next to the file
4. Select "Download"

### Method 2: Kaggle Dataset
1. Click "File" → "Save Version"
2. Select "Save & Run All"
3. After completion, output files will be available as a Kaggle Dataset
4. You can then download via Kaggle API:
```bash
kaggle kernels output YOUR_USERNAME/battery-rul-generation -p ./downloads/
```

### Method 3: Google Drive (For Large Files)
Run the cell below to upload to your Google Drive:

In [None]:
# Uncomment to upload to Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
# !cp battery_rul_2year_dataset.tar.gz /content/drive/MyDrive/

## Next Steps

After downloading the dataset:

1. **Extract the archive**:
   ```bash
   tar -xzf battery_rul_2year_dataset.tar.gz
   ```

2. **Load into database** (if using backend):
   - Use data loading scripts in the backend repository
   - Ensure schema alignment

3. **Train ML models**:
   - Use `Battery_RUL_Training.ipynb` notebook
   - Features are in `feature_store.csv.gz`
   - Ground truth is in `battery_states.json`

4. **Deploy predictions**:
   - Train your model
   - Export to ONNX/TorchScript
   - Integrate with backend API
   - Deploy to Railway

**GitHub Repository**: https://github.com/khiwniti/battery-rul-data-generation