In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

path = "../data_raw/charging_sessions.csv"
df = pd.read_csv(path)



Get an overview over the data through info & head

In [3]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66450 entries, 0 to 66449
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        66450 non-null  int64  
 1   id                66450 non-null  object 
 2   connectionTime    66450 non-null  object 
 3   disconnectTime    66450 non-null  object 
 4   doneChargingTime  62362 non-null  object 
 5   kWhDelivered      66450 non-null  float64
 6   sessionID         66450 non-null  object 
 7   siteID            66450 non-null  int64  
 8   spaceID           66450 non-null  object 
 9   stationID         66450 non-null  object 
 10  timezone          66450 non-null  object 
 11  userID            49187 non-null  float64
 12  userInputs        49187 non-null  object 
dtypes: float64(2), int64(2), object(9)
memory usage: 6.6+ MB


Unnamed: 0.1,Unnamed: 0,id,connectionTime,disconnectTime,doneChargingTime,kWhDelivered,sessionID,siteID,spaceID,stationID,timezone,userID,userInputs
0,0,5e23b149f9af8b5fe4b973cf,2020-01-02 13:08:54+00:00,2020-01-02 19:11:15+00:00,2020-01-02 17:31:35+00:00,25.016,1_1_179_810_2020-01-02 13:08:53.870034,1,AG-3F30,1-1-179-810,America/Los_Angeles,194.0,"[{'WhPerMile': 250, 'kWhRequested': 25.0, 'mil..."
1,1,5e23b149f9af8b5fe4b973d0,2020-01-02 13:36:50+00:00,2020-01-02 22:38:21+00:00,2020-01-02 20:18:05+00:00,33.097,1_1_193_825_2020-01-02 13:36:49.599853,1,AG-1F01,1-1-193-825,America/Los_Angeles,4275.0,"[{'WhPerMile': 280, 'kWhRequested': 70.0, 'mil..."
2,2,5e23b149f9af8b5fe4b973d1,2020-01-02 13:56:35+00:00,2020-01-03 00:39:22+00:00,2020-01-02 16:35:06+00:00,6.521,1_1_193_829_2020-01-02 13:56:35.214993,1,AG-1F03,1-1-193-829,America/Los_Angeles,344.0,"[{'WhPerMile': 400, 'kWhRequested': 8.0, 'mile..."
3,3,5e23b149f9af8b5fe4b973d2,2020-01-02 13:59:58+00:00,2020-01-02 16:38:39+00:00,2020-01-02 15:18:45+00:00,2.355,1_1_193_820_2020-01-02 13:59:58.309319,1,AG-1F04,1-1-193-820,America/Los_Angeles,1117.0,"[{'WhPerMile': 400, 'kWhRequested': 8.0, 'mile..."
4,4,5e23b149f9af8b5fe4b973d3,2020-01-02 14:00:01+00:00,2020-01-02 22:08:40+00:00,2020-01-02 18:17:30+00:00,13.375,1_1_193_819_2020-01-02 14:00:00.779967,1,AG-1F06,1-1-193-819,America/Los_Angeles,334.0,"[{'WhPerMile': 400, 'kWhRequested': 16.0, 'mil..."


## Clean Data

### Drop the index column "Unnamed: 0"

The column has 66450 empty values. This is equivalent to the amount of rows in the dataset, therefore does not provide any value and can be removed.

In [4]:
# 1) Drop explicit index column
df = df.drop(columns=['Unnamed: 0'])

## Format Values

### Parse all datetimes in the Localtime America/Los_Angeles for consistency

In [5]:
unique_timezones = df["timezone"].dropna().unique()
print("Unique timezones:", unique_timezones)

Unique timezones: ['America/Los_Angeles']


Finding: All records share a single timezone (America/Los_Angeles).

The three timestamp columns are therefore parsed as UTC and converted to this local timezone for consistency.

In [6]:
time_cols = ["connectionTime", "disconnectTime", "doneChargingTime"]
local_timezone = unique_timezones[0]
for col in time_cols:
    df[col] = pd.to_datetime(df[col], utc=True, errors="coerce")
    df[col] = df[col].dt.tz_convert(local_timezone)

### Parse all values of kwhDelivered in floats

In [7]:
# dont think this is necessary as dytpe is already float
print(df['kWhDelivered'].dtype)

df['kWhDelivered'] = pd.to_numeric(df['kWhDelivered'], errors='coerce')

float64


### Use string dytpe and format string for categorical columns

In [8]:
cat_cols = ["siteID", "spaceID", "stationID", "timezone", "id", "sessionID"]
for col in cat_cols:
    # print(f'Unique values in {col}: ', df[col].nunique(), f' vs. Unique values in {col} after cleaning: ', df[col].astype(str).str.strip().str.lower().nunique())
    df[col] = df[col].astype(str).str.strip().str.lower()

## Handle Missing Values

In [9]:
print(df.isnull().sum())

id                      0
connectionTime          0
disconnectTime          0
doneChargingTime     4088
kWhDelivered            0
sessionID               0
siteID                  0
spaceID                 0
stationID               0
timezone                0
userID              17263
userInputs          17263
dtype: int64


Finding: The columns doneChargingTime, userID, and userInputs have missing values.

### Handle missing doneChargingTime values

In [10]:
mask_na = df["doneChargingTime"].isna() & df["kWhDelivered"].gt(0)
df_na = df.loc[mask_na].copy()
print('COUNT missing doneChargingTime with kWhDelivered > 0: ', mask_na.sum())

COUNT missing doneChargingTime with kWhDelivered > 0:  4088


There are 4,088 records where doneChargingTime is missing. Every one of these records has positive kWhDelivered. This indicates that the charging process occurred normally, but the system failed to log doneChargingTime. These records will be analyzed further to verify that imputing doneChargingTime = disconnectTime is reasonable.

In [11]:
df_na["duration_h"] = (
    df_na["disconnectTime"] - df_na["connectionTime"]
).dt.total_seconds() / 3600
df_na[["duration_h", "kWhDelivered"]].describe(include="all")

Unnamed: 0,duration_h,kWhDelivered
count,4088.0,4088.0
mean,4.744335,13.983593
std,4.087304,11.767161
min,0.034444,0.502
25%,1.028472,5.77675
50%,3.880278,10.8945
75%,8.400486,18.1435
max,57.457222,77.7


The missing doneChargingTime sessions show plausible charging behavior:

* Average session duration ≈ 4.7 hours, median ≈ 3.9 hours

* Average energy delivered ≈ 14 kWh, with max ≈ 78 kWh

* No negative or zero durations appear, indicating correct chronological order

Although setting doneChargingTime = disconnectTime  represents the latest possible end of charging and may overestimate the true charging duration for some sessions, it is the only defensible imputation given the available data.

Dropping these sessions would remove 6.15% of all sessions and introduce systematic bias, while estimating an earlier timestamp would require unavailable information (EV model, SOC, charging curve).

Therefore, imputing doneChargingTime = disconnectTime is the most sound approach.

In [12]:
# If doneChargingTime is missing but energy was delivered -> set to disconnectTime
df.loc[mask_na, "doneChargingTime"] = df.loc[mask_na, "disconnectTime"]

## Check for invalid values

### Check for invalid doneChargingTime entries

To ensure temporal consistency, `doneChargingTime` must be between `connectionTime` and `disconnectTime`.  

In [13]:
mask_1 = df["doneChargingTime"] < df["connectionTime"]
mask_2 = df["doneChargingTime"] > df["disconnectTime"]
print('COUNT doneChargingTime < connectionTime: ', mask_1.sum())
print('COUNT doneChargingTime > disconnectTime: ', mask_2.sum())

df_invalid = df[mask_1 | mask_2].copy()

COUNT doneChargingTime < connectionTime:  27
COUNT doneChargingTime > disconnectTime:  4692


We identified two types of temporal inconsistencies in the dataset:

* `doneChargingTime < connectionTime`: 27 cases  
* `doneChargingTime > disconnectTime`: 4,692 cases  

We then quantified how far the invalid `doneChargingTime` values deviate from their valid bounds:

In [14]:
# How many invalid doneChargingTime values are "close" (e.g. within 300 seconds)?

threshold = 300  # seconds

early_off = (df.loc[mask_1, "connectionTime"] - df.loc[mask_1, "doneChargingTime"]).dt.total_seconds().abs()
late_off  = (df.loc[mask_2, "doneChargingTime"] - df.loc[mask_2, "disconnectTime"]).dt.total_seconds().abs()

print(f"EARLY  cases: {len(early_off)} sessions")
print(f"  -> { (early_off <= threshold).mean():.2%} within {threshold} seconds of connectionTime\n")

print(f"LATE   cases: {len(late_off)} sessions")
print(f"  -> { (late_off <= threshold).mean():.2%} within {threshold} seconds of disconnectTime")


EARLY  cases: 27 sessions
  -> 88.89% within 300 seconds of connectionTime

LATE   cases: 4692 sessions
  -> 99.96% within 300 seconds of disconnectTime


These results show that the most inconsistencies are extremely small (typically just a few seconds or minutes), indicating minor logging delays rather than invalid sessions. Therefore the most transparent and sound correction is to **clip `doneChargingTime` into the valid interval**. This restores temporal consistency while preserving all meaningful charging sessions for subsequent analysis.

In [15]:
# Clip doneChargingTime to [connectionTime, disconnectTime]
df.loc[mask_1, "doneChargingTime"] = df.loc[mask_1, "connectionTime"]
df.loc[mask_2, "doneChargingTime"] = df.loc[mask_2, "disconnectTime"]

### Check for sessions where the disconnectTime is before connection

In [16]:
# Sessions where disconnectTime < connectionTime
print('COUNT disconnectTime < connectionTime: ', (df["disconnectTime"] < df["connectionTime"]).sum())

COUNT disconnectTime < connectionTime:  0


There are no sessions where disconnectTime < connectionTime

### Check for sessions where kWhDelivered is negative

In [17]:
# Sessions where kWhDelivered is negative
print('COUNT negative energy rows: ', (df["kWhDelivered"] < 0).sum())


COUNT negative energy rows:  0


There are no sessions where kWhDelivered is negative

### Check for sessions where duration is negative

In [18]:
# Check for session with negative duration
duration_h = (df["disconnectTime"] - df["connectionTime"]).dt.total_seconds() / 3600.0
print('COUNT of sessions <= 0 h: ', (duration_h <= 0).sum())


COUNT of sessions <= 0 h:  0


There are no session where duration is negative

## Handle duplicated rows

### Handle duplicates on sessionID

In [19]:
dup_mask = df["sessionID"].duplicated(keep=False)
num_duplicated_sessions = dup_mask.sum()
print("Rows with duplicated sessionID:", num_duplicated_sessions)

Rows with duplicated sessionID: 2826


Each `sessionID` should represent exactly one charging session, but **2,826 rows** appear more than once.    
These duplicates indicate repeated logging of the same session and must be consolidated.

We first remove duplicate `sessionID`s (keeping the latest record per session) and then run checks on `id` and the physical key (`stationID`, `connectionTime`) to
confirm that no duplicates remain.

In [20]:
sort_cols = ["sessionID", "connectionTime"]
df = df.sort_values(sort_cols)
df = df.drop_duplicates(subset=["sessionID"], keep="last")

In [21]:
# Check duplicates on id
print(df["id"].duplicated().sum())

# Check duplicates on physical key (spaceID, connectionTime)
print(df.duplicated(subset=["spaceID", "connectionTime"], keep=False).sum())

df[df.duplicated(subset=["spaceID", "connectionTime"], keep=False)]


0
2


Unnamed: 0,id,connectionTime,disconnectTime,doneChargingTime,kWhDelivered,sessionID,siteID,spaceID,stationID,timezone,userID,userInputs
18615,610b311df9af8b0360885063,2021-07-19 13:11:16-07:00,2021-07-19 14:58:08-07:00,2021-07-19 13:45:35-07:00,5.079,1_1_179_794_2021-07-19 19:10:21.987739,1,ag-3f20,1-1-179-794,america/los_angeles,10894.0,"[{'WhPerMile': 304, 'kWhRequested': 42.56, 'mi..."
18616,610b311df9af8b0360885072,2021-07-19 13:11:16-07:00,2021-07-19 17:25:14-07:00,2021-07-19 13:45:35-07:00,12.412858,1_1_179_794_2021-07-19 23:20:30.289881,1,ag-3f20,1-1-179-794,america/los_angeles,,
