# Cluster Analysis

### Task Description
To better understand what typical charging sessions look like, carry out a cluster
analysis to provide management with a succinct report of archetypical charging events. Think of an
appropriate trade-off between explainability and information content and try to come up with names
for these clusters. What is the value of identifying different types of charging sessions?

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


### Data Prep

In [2]:
df_charging_sessions = pd.read_csv("charging_sessions_preprocessed.csv",  parse_dates=['connectionTime','disconnectTime','doneChargingTime']);
# convert to local time
df_charging_sessions["connectionTime"] = df_charging_sessions["connectionTime"].dt.tz_convert("America/Los_Angeles")
df_charging_sessions["disconnectTime"] = df_charging_sessions["disconnectTime"].dt.tz_convert("America/Los_Angeles")
df_charging_sessions["doneChargingTime"] = df_charging_sessions["doneChargingTime"].dt.tz_convert("America/Los_Angeles")
print(df_charging_sessions.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60921 entries, 0 to 60920
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype                              
---  ------               --------------  -----                              
 0   id                   60921 non-null  object                             
 1   connectionTime       60921 non-null  datetime64[ns, America/Los_Angeles]
 2   disconnectTime       60921 non-null  datetime64[ns, America/Los_Angeles]
 3   doneChargingTime     60921 non-null  datetime64[ns, America/Los_Angeles]
 4   kWhDelivered         60921 non-null  float64                            
 5   sessionID            60921 non-null  object                             
 6   siteID               60921 non-null  int64                              
 7   spaceID              60921 non-null  object                             
 8   stationID            60921 non-null  object                             
 9   userID               44636 n

In [3]:
print(len(df_charging_sessions))
charging_sessions_clustering = df_charging_sessions.copy()
charging_sessions_clustering = charging_sessions_clustering.drop_duplicates()
print(len(charging_sessions_clustering))

60921


60921


In [4]:
print(len(charging_sessions_clustering['sessionID'].unique()))
print(len(charging_sessions_clustering['sessionID']))

60921
60921


After dropping the duplicates, sessionIDs are unique.

For the cluster analysis, we include the weather data as it could provide important information about the types of charging sessions.
We merge the two datasets based on the day and hour the charging session started, as we only have hourly information about the weather. We analyse the temperature and the description of the cloud cover. With this information, we know if, for example, it was cold and rainy during a charging session.

In [6]:
# read in weather data
weather_set = pd.read_csv('weather_burbank_airport_preprocessed.csv', parse_dates=['timestamp'])

# 'timestamp' in datetime umwandeln und Zeitzone zuweisen
weather_set['timestamp'] = pd.to_datetime(weather_set['timestamp']).dt.tz_localize('UTC').dt.tz_convert('America/Los_Angeles')

# Ergebnisse anzeigen
print(weather_set.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29244 entries, 0 to 29243
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype                              
---  ------                   --------------  -----                              
 0   timestamp                29244 non-null  datetime64[ns, America/Los_Angeles]
 1   temperature              29244 non-null  float64                            
 2   cloud_cover              29244 non-null  float64                            
 3   cloud_cover_description  29244 non-null  object                             
 4   pressure                 29244 non-null  float64                            
 5   windspeed                29244 non-null  float64                            
 6   precipitation            29244 non-null  float64                            
 7   felt_temperature         29244 non-null  float64                            
dtypes: datetime64[ns, America/Los_Angeles](1), float64(6), object(1)
me

In [7]:
# drop columns that are not needed
weather_set = weather_set.drop(['cloud_cover', 'cloud_cover_description','pressure', 'windspeed', 'felt_temperature'], axis=1)
print(weather_set.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29244 entries, 0 to 29243
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype                              
---  ------         --------------  -----                              
 0   timestamp      29244 non-null  datetime64[ns, America/Los_Angeles]
 1   temperature    29244 non-null  float64                            
 2   precipitation  29244 non-null  float64                            
dtypes: datetime64[ns, America/Los_Angeles](1), float64(2)
memory usage: 685.5 KB
None


In [8]:
weather_set['connectionDate'] = weather_set['timestamp'].dt.date
weather_set['connectionHour'] = weather_set['timestamp'].dt.hour
days_unique = weather_set['connectionDate'].unique()
days = np.repeat(days_unique,24)
hours = list(range(0,24))
hours = hours*len(days_unique)

weather_hourly = pd.DataFrame()
weather_hourly['connectionDate'] = days
weather_hourly['connectionHour'] = hours
print(len(weather_hourly))
print(weather_hourly)

26280
      connectionDate  connectionHour
0         2018-01-01               0
1         2018-01-01               1
2         2018-01-01               2
3         2018-01-01               3
4         2018-01-01               4
...              ...             ...
26275     2020-12-31              19
26276     2020-12-31              20
26277     2020-12-31              21
26278     2020-12-31              22
26279     2020-12-31              23

[26280 rows x 2 columns]


In [9]:
# add columns temperature and precipitation
# if they are several values for an hour of a day, mean is caluclated
temperature = []
precipitation = []
for index, row in weather_hourly.iterrows():
    searched_day = row['connectionDate']
    searched_hour = row['connectionHour']
    temperatures_of_hour = weather_set[(weather_set['connectionDate']==searched_day) & (weather_set['connectionHour']==searched_hour)]['temperature']
    temperature.append(temperatures_of_hour.mean())
    precipitation_of_hour = weather_set[(weather_set['connectionDate']==searched_day) & (weather_set['connectionHour']==searched_hour)]['precipitation']
    precipitation.append(precipitation_of_hour.mean())

weather_hourly['temperature'] = temperature
weather_hourly['precipitation'] = precipitation
print(weather_hourly.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26280 entries, 0 to 26279
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   connectionDate  26280 non-null  object 
 1   connectionHour  26280 non-null  int64  
 2   temperature     26210 non-null  float64
 3   precipitation   26210 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 821.4+ KB
None


We have some Null values for temperature and perceptibility. This is due to missing data in the weather dataset. The set does not contain information for every hour of every day. For these hours we calculate the average of the day and replace the Null values with the result.

In [10]:
def fill_in_mean_of_day(df,column):
    for index, row in df.iterrows():
        value = row[column]
        if pd.isna(value):
            day_of_null = row['connectionDate']
            mean_of_day = df[df['connectionDate']==day_of_null][column].mean()
            df.loc[index, column] = mean_of_day
    return df  

In [11]:
weather_hourly_cleaned = fill_in_mean_of_day(weather_hourly, 'temperature')
weather_hourly_cleaned = fill_in_mean_of_day(weather_hourly, 'precipitation')
print(weather_hourly_cleaned.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26280 entries, 0 to 26279
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   connectionDate  26280 non-null  object 
 1   connectionHour  26280 non-null  int64  
 2   temperature     26280 non-null  float64
 3   precipitation   26280 non-null  float64
dtypes: float64(2), int64(1), object(1)
memory usage: 821.4+ KB
None


In [12]:
weather_hourly_cleaned.to_csv('weather_for_Prediction.csv', index=False)