# Task 1: Data Collection and Preparation

In [1]:
import pandas as pd

## Charging Sessions Dataset

Read in the data as a dataframe. Interpret the date columns as dates.

In [2]:
Charging_set = pd.read_csv('charging_sessions.csv', parse_dates=['connectionTime','disconnectTime','doneChargingTime'])
print('Columns: ', Charging_set.columns)
print('Number of rows: ', Charging_set.shape[0])

Columns:  Index(['Unnamed: 0', 'id', 'connectionTime', 'disconnectTime',
       'doneChargingTime', 'kWhDelivered', 'sessionID', 'siteID', 'spaceID',
       'stationID', 'timezone', 'userID', 'userInputs'],
      dtype='object')
Number of rows:  66450


Define functions to analyze the columns.

In [3]:
# compute the number of missing values
def missingValues(columnName, dataSet) :
    if (dataSet[columnName].isnull().any()):
        percentage = round(dataSet[columnName].isnull().sum()/dataSet.shape[0]*100, 2)
        print('The column contains', dataSet[columnName].isnull().sum(), 'missing values. This corresponds to', percentage, '%.')
    else:
        print('The column does not contain missing values.')
        
# compute the value range
def valueRange(columnName, dataSet) :
    print('Min. value: ', dataSet[columnName].min(), '\nMax. value: ', dataSet[columnName].max())

# compute the data types
def dataTypes(columnName, dataSet):
    print('Occuring data types: ', dataSet[columnName].apply(type).unique())

### deleted Column: unnamed

In [4]:
dataTypes('Unnamed: 0', Charging_set)
valueRange('Unnamed: 0', Charging_set)
missingValues('Unnamed: 0', Charging_set)

Occuring data types:  [<class 'int'>]
Min. value:  0 
Max. value:  15291
The column does not contain missing values.


This column does not provide any information that we can use. Therefore it can be deleted.

In [5]:
Charging_set = Charging_set.drop(['Unnamed: 0'], axis=1)

### Delete duplicates

In [6]:
numRows = Charging_set.shape[0]
Charging_set = Charging_set.drop_duplicates()
print('Number of duplicate rows: ', numRows-Charging_set.shape[0])

Number of duplicate rows:  1413


### Column: id - unique identifier of the session record

In [7]:
dataTypes('id', Charging_set)
valueRange('id', Charging_set)
missingValues('id', Charging_set)

Occuring data types:  [<class 'str'>]
Min. value:  5bc90cb9f9af8b0d7fe77cd2 
Max. value:  6155053bf9af8b76960e16d1
The column does not contain missing values.


### Column: connectionTime - time when the EV plugged in

In [8]:
dataTypes('connectionTime', Charging_set)
valueRange('connectionTime', Charging_set)
missingValues('connectionTime', Charging_set)

Occuring data types:  [<class 'pandas._libs.tslibs.timestamps.Timestamp'>]
Min. value:  2018-04-25 11:08:04+00:00 
Max. value:  2021-09-14 05:43:39+00:00
The column does not contain missing values.


### Column: disconnectTime - time when the EV unplugged

In [9]:
dataTypes('disconnectTime', Charging_set)
valueRange('disconnectTime', Charging_set)
missingValues('disconnectTime', Charging_set)

Occuring data types:  [<class 'pandas._libs.tslibs.timestamps.Timestamp'>]
Min. value:  2018-04-25 13:20:10+00:00 
Max. value:  2021-09-14 14:46:28+00:00
The column does not contain missing values.


Check for erroneous values:

In [10]:
print('In ', Charging_set[Charging_set['disconnectTime'] < Charging_set['connectionTime']].shape[0], 'rows the disconnectTime is earlier than the connectionTime.')

In  0 rows the disconnectTime is earlier than the connectionTime.


### new Column: totalConnectionTime - total time the EV was plugged in

In [11]:
Charging_set['totalConnectionTime'] = Charging_set['disconnectTime'] - Charging_set['connectionTime']
dataTypes('totalConnectionTime', Charging_set)
valueRange('totalConnectionTime', Charging_set)
missingValues('totalConnectionTime', Charging_set)

Occuring data types:  [<class 'pandas._libs.tslibs.timedeltas.Timedelta'>]
Min. value:  0 days 00:02:04 
Max. value:  10 days 05:16:09
The column does not contain missing values.


### Column: doneChargingTime - time of the last non-zero current draw recorded

In [12]:
dataTypes('doneChargingTime', Charging_set)
valueRange('doneChargingTime', Charging_set)
missingValues('doneChargingTime', Charging_set)
# Charging_set[Charging_set['kWhDelivered'] == 0] -> not because of no kWh delivered 
# Charging_set[Charging_set['userInputs'].isnull() & Charging_set['doneChargingTime'].isnull()] -> not because of the userInput

Occuring data types:  [<class 'pandas._libs.tslibs.timestamps.Timestamp'>
 <class 'pandas._libs.tslibs.nattype.NaTType'>]
Min. value:  2018-04-25 13:21:10+00:00 
Max. value:  2021-09-14 14:46:22+00:00
The column contains 4087 missing values. This corresponds to 6.28 %.


Check for erroneous values:

In [13]:
print('In these', Charging_set[Charging_set['doneChargingTime'] < Charging_set['connectionTime']].shape[0], 'rows the doneChargingTime is earlier than the connectionTime.')
Charging_set[Charging_set['doneChargingTime'] < Charging_set['connectionTime']]
# Charging_set[(Charging_set['doneChargingTime'] < Charging_set['connectionTime']) & Charging_set['userInputs'].isnull()] -> not because of the userInput

In these 27 rows the doneChargingTime is earlier than the connectionTime.


Unnamed: 0,id,connectionTime,disconnectTime,doneChargingTime,kWhDelivered,sessionID,siteID,spaceID,stationID,timezone,userID,userInputs,totalConnectionTime
22219,5c942ca4f9af8b06b04b3bb4,2019-03-05 19:13:55+00:00,2019-03-05 22:50:39+00:00,2019-03-05 19:12:56+00:00,0.706655,2_39_78_367_2019-03-05 19:13:55.113078,2,CA-494,2-39-78-367,America/Los_Angeles,,,0 days 03:36:44
22253,5c957e1cf9af8b42f440af03,2019-03-06 20:26:30+00:00,2019-03-07 01:48:54+00:00,2019-03-06 20:25:34+00:00,1.046381,2_39_78_367_2019-03-06 20:26:30.479644,2,CA-494,2-39-78-367,America/Los_Angeles,,,0 days 05:22:24
23562,5cca3a22f9af8b49aaa4cba0,2019-04-15 20:24:13+00:00,2019-04-15 23:39:04+00:00,2019-04-15 20:23:14+00:00,0.635278,2_39_78_367_2019-04-15 20:24:13.365605,2,CA-494,2-39-78-367,America/Los_Angeles,1154.0,"[{'WhPerMile': 308, 'kWhRequested': 9.24, 'mil...",0 days 03:14:51
23586,5ccb8ba6f9af8b4d9721df00,2019-04-16 16:11:08+00:00,2019-04-16 19:10:48+00:00,2019-04-16 16:10:11+00:00,0.585977,2_39_78_367_2019-04-16 16:11:07.939710,2,CA-494,2-39-78-367,America/Los_Angeles,1154.0,"[{'WhPerMile': 308, 'kWhRequested': 6.16, 'mil...",0 days 02:59:40
27689,5d856f1ff9af8b0c7bdf245c,2019-09-04 16:35:04+00:00,2019-09-05 00:44:27+00:00,2019-09-04 16:34:05+00:00,1.5845,2_39_78_367_2019-09-04 16:35:04.129327,2,CA-494,2-39-78-367,America/Los_Angeles,,,0 days 08:09:23
27740,5d86c0a5f9af8b1022a81870,2019-09-05 18:44:57+00:00,2019-09-06 00:55:19+00:00,2019-09-05 18:43:57+00:00,1.06723,2_39_78_360_2019-09-05 18:44:57.410168,2,CA-322,2-39-78-360,America/Los_Angeles,,,0 days 06:10:22
29295,5dcdffbdf9af8b220a19be8b,2019-10-29 17:22:32+00:00,2019-10-31 01:57:20+00:00,2019-10-29 17:21:33+00:00,6.31621,2_39_78_367_2019-10-29 17:22:32.086306,2,CA-494,2-39-78-367,America/Los_Angeles,1470.0,"[{'WhPerMile': 292, 'kWhRequested': 14.6, 'mil...",1 days 08:34:48
31285,5bc91740f9af8b0dc677b860,2018-05-04 19:08:37+00:00,2018-05-04 22:07:47+00:00,2018-05-04 19:07:40+00:00,0.551722,2_39_78_363_2018-05-04 19:08:36.642114,2,CA-320,2-39-78-363,America/Los_Angeles,,,0 days 02:59:10
31287,5bc91740f9af8b0dc677b862,2018-05-04 19:23:52+00:00,2018-05-05 00:04:15+00:00,2018-05-04 19:22:52+00:00,0.912297,2_39_78_367_2018-05-04 19:23:51.897392,2,CA-494,2-39-78-367,America/Los_Angeles,,,0 days 04:40:23
31403,5bc917d0f9af8b0dc677b8d6,2018-05-07 20:47:51+00:00,2018-05-08 02:16:00+00:00,2018-05-07 20:47:50+00:00,14.967,2_39_139_567_2018-05-07 20:47:50.862655,2,CA-513,2-39-139-567,America/Los_Angeles,,,0 days 05:28:09


Since doneChargingTime cannot be earier than the connectionTime, these values are incorrect. We can delete the belonging rows, because they only correspond to a very small part of the data set.

In [14]:
Charging_set = Charging_set[Charging_set['doneChargingTime'] >= Charging_set['connectionTime']]

In [15]:
print('In these', Charging_set[Charging_set['disconnectTime'] < Charging_set['doneChargingTime']].shape[0], 'rows the disconnectTime is earlier than the doneChargingTime.')
Charging_set[Charging_set['disconnectTime'] < Charging_set['doneChargingTime']]

In these 4387 rows the disconnectTime is earlier than the doneChargingTime.


Unnamed: 0,id,connectionTime,disconnectTime,doneChargingTime,kWhDelivered,sessionID,siteID,spaceID,stationID,timezone,userID,userInputs,totalConnectionTime
12,5e23b149f9af8b5fe4b973db,2020-01-02 15:04:38+00:00,2020-01-02 22:08:39+00:00,2020-01-02 22:09:36+00:00,25.567,1_1_178_824_2020-01-02 15:04:38.051735,1,AG-1F07,1-1-178-824,America/Los_Angeles,528.0,"[{'WhPerMile': 250, 'kWhRequested': 50.0, 'mil...",0 days 07:04:01
20,5e23b149f9af8b5fe4b973e3,2020-01-02 15:28:47+00:00,2020-01-02 19:01:54+00:00,2020-01-02 19:02:51+00:00,7.417,1_1_193_827_2020-01-02 15:28:46.685366,1,AG-1F02,1-1-193-827,America/Los_Angeles,1283.0,"[{'WhPerMile': 350, 'kWhRequested': 42.0, 'mil...",0 days 03:33:07
25,5e23b149f9af8b5fe4b973e8,2020-01-02 15:42:05+00:00,2020-01-02 21:58:45+00:00,2020-01-02 21:59:42+00:00,36.701,1_1_179_797_2020-01-02 15:42:05.217965,1,AG-3F23,1-1-179-797,America/Los_Angeles,474.0,"[{'WhPerMile': 400, 'kWhRequested': 32.0, 'mil...",0 days 06:16:40
26,5e23b149f9af8b5fe4b973e9,2020-01-02 15:57:24+00:00,2020-01-02 16:35:37+00:00,2020-01-02 16:36:34+00:00,3.689,1_1_179_781_2020-01-02 15:57:23.951170,1,AG-3F31,1-1-179-781,America/Los_Angeles,724.0,"[{'WhPerMile': 400, 'kWhRequested': 8.0, 'mile...",0 days 00:38:13
33,5e23b149f9af8b5fe4b973f0,2020-01-02 16:34:35+00:00,2020-01-02 18:49:41+00:00,2020-01-02 18:50:38+00:00,7.120,1_1_179_790_2020-01-02 16:34:34.999200,1,AG-3F19,1-1-179-790,America/Los_Angeles,2276.0,"[{'WhPerMile': 600, 'kWhRequested': 18.0, 'mil...",0 days 02:15:06
...,...,...,...,...,...,...,...,...,...,...,...,...,...
65028,5d2fbdd3f9af8b4d0dd0d546,2019-07-01 19:20:31+00:00,2019-07-02 00:16:32+00:00,2019-07-02 00:16:42+00:00,26.324,1_1_179_783_2019-07-01 19:20:30.955300,1,AG-3F29,1-1-179-783,America/Los_Angeles,458.0,"[{'WhPerMile': 400, 'kWhRequested': 8.0, 'mile...",0 days 04:56:01
65030,5d2fbdd3f9af8b4d0dd0d548,2019-07-01 20:01:07+00:00,2019-07-02 00:32:26+00:00,2019-07-02 00:32:59+00:00,21.588,1_1_179_800_2019-07-01 20:01:06.782562,1,AG-3F32,1-1-179-800,America/Los_Angeles,1479.0,"[{'WhPerMile': 275, 'kWhRequested': 19.25, 'mi...",0 days 04:31:19
65033,5d2fbdd3f9af8b4d0dd0d54b,2019-07-01 21:58:45+00:00,2019-07-02 00:39:48+00:00,2019-07-02 00:40:21+00:00,16.864,1_1_179_794_2019-07-01 21:58:44.571011,1,AG-3F20,1-1-179-794,America/Los_Angeles,364.0,"[{'WhPerMile': 400, 'kWhRequested': 40.0, 'mil...",0 days 02:41:03
65034,5d2fbdd3f9af8b4d0dd0d54c,2019-07-01 22:02:21+00:00,2019-07-02 00:58:50+00:00,2019-07-02 00:59:23+00:00,18.335,1_1_191_807_2019-07-01 22:02:20.810735,1,AG-4F47,1-1-191-807,America/Los_Angeles,2050.0,"[{'WhPerMile': 333, 'kWhRequested': 29.97, 'mi...",0 days 02:56:29


In [16]:
Charging_set['timeDifference'] = Charging_set['doneChargingTime'] - Charging_set['disconnectTime']

print("Difference between disconnectTime and doneChargingTime when disconnectTime is earlier than doneChargingTime: \n",
      "min: ", Charging_set.loc[Charging_set['timeDifference'] > pd.Timedelta(minutes=0), 'timeDifference'].min(),
      "\n max: ", Charging_set.loc[Charging_set['timeDifference'] > pd.Timedelta(minutes=0), 'timeDifference'].max(),
      "\n mean: ", Charging_set.loc[Charging_set['timeDifference'] > pd.Timedelta(minutes=0), 'timeDifference'].mean())

print('Number of rows with a difference between diconnectTime and doneChargingTime that is bigger than two minutes: ', 
      Charging_set.loc[Charging_set['timeDifference'] >= pd.Timedelta(minutes=2), 'timeDifference'].count())

Difference between disconnectTime and doneChargingTime when disconnectTime is earlier than doneChargingTime: 
 min:  0 days 00:00:01 
 max:  0 days 00:59:56 
 mean:  0 days 00:00:56.017095965
Number of rows with a difference between diconnectTime and doneChargingTime that is bigger than two minutes:  2


It can happen that current is still flowing a few minutes after the EV is unplugged. At this point, charging is no longer taking place. In these cases, we therefore set the doneChargingTime equal to the disconnectTime. 

There are 2 outliers where the doneChargingTime is significantly later than 2 minutes away from the disconnectTime. We delete these rows as there is no explanation for this large difference and therefore the values are incorrect.

In [17]:
# delete outliers
Charging_set = Charging_set.loc[Charging_set['timeDifference'] < pd.Timedelta(minutes=2)]

# doneChargingTime = disconnectTime
Charging_set.loc[Charging_set['timeDifference'] > pd.Timedelta(minutes=0), 'doneChargingTime'] = Charging_set.loc[Charging_set['timeDifference'] > pd.Timedelta(minutes=0), 'disconnectTime']

Charging_set.drop(columns=['timeDifference'], inplace=True)

We estimate the missing values of the doneChargingTime by adding the average charging time per kWh * the kWh that were delivered to the connectionTime.

In [18]:
TotalChargingTime = Charging_set['doneChargingTime'] - Charging_set['connectionTime']
AverageTimePerKwh = (TotalChargingTime/Charging_set['kWhDelivered']).mean()
Charging_set['doneChargingTime'] = Charging_set['doneChargingTime'].fillna(value=Charging_set['connectionTime']+AverageTimePerKwh*Charging_set['kWhDelivered']).dt.round('1s')
dataTypes('doneChargingTime', Charging_set)
valueRange('doneChargingTime', Charging_set)
missingValues('doneChargingTime', Charging_set)

Occuring data types:  [<class 'pandas._libs.tslibs.timestamps.Timestamp'>]
Min. value:  2018-04-25 13:20:10+00:00 
Max. value:  2021-09-14 14:46:22+00:00
The column does not contain missing values.


We need to check the dependencies between the different time values again.

In [19]:
print('In ', Charging_set[Charging_set['disconnectTime'] < Charging_set['connectionTime']].shape[0], 'rows the disconnectTime is earlier than the connectionTime.')
print('In ', Charging_set[Charging_set['doneChargingTime'] < Charging_set['connectionTime']].shape[0], 'rows the doneChargingTime is earlier than the connectionTime.')
print('In ', Charging_set[Charging_set['disconnectTime'] < Charging_set['doneChargingTime']].shape[0], 'rows the disconnectTime is earlier than the doneChargingTime.')

In  0 rows the disconnectTime is earlier than the connectionTime.
In  0 rows the doneChargingTime is earlier than the connectionTime.
In  0 rows the disconnectTime is earlier than the doneChargingTime.


### new Column: totalChargingTime - total time the EV was charging

In [20]:
Charging_set['totalChargingTime'] = Charging_set['doneChargingTime'] - Charging_set['connectionTime']
dataTypes('totalChargingTime', Charging_set)
valueRange('totalChargingTime', Charging_set)
missingValues('totalChargingTime', Charging_set)

Occuring data types:  [<class 'pandas._libs.tslibs.timedeltas.Timedelta'>]
Min. value:  0 days 00:00:00 
Max. value:  8 days 08:00:57
The column does not contain missing values.


### Column: kWhDelivered - amount of energy delivered during the session

In [21]:
dataTypes('kWhDelivered', Charging_set)
valueRange('kWhDelivered', Charging_set)
missingValues('kWhDelivered', Charging_set)

Occuring data types:  [<class 'float'>]
Min. value:  0.501 
Max. value:  108.79724166666666
The column does not contain missing values.


### Column: sessionID - unique identifier for the session

In [22]:
dataTypes('sessionID', Charging_set)
valueRange('sessionID', Charging_set)
missingValues('sessionID', Charging_set)

Occuring data types:  [<class 'str'>]
Min. value:  1_1_178_817_2018-10-08 20:54:24.215877 
Max. value:  2_39_95_444_2021-07-14 16:54:32.800222
The column does not contain missing values.


Where ist the difference to the column id?

### Column: siteID - unique identifier for the site

In [23]:
dataTypes('siteID', Charging_set)
valueRange('siteID', Charging_set)
missingValues('siteID', Charging_set)

Occuring data types:  [<class 'int'>]
Min. value:  1 
Max. value:  2
The column does not contain missing values.


### Column: spaceID - unique identifier for the parking space

In [24]:
dataTypes('spaceID', Charging_set)
valueRange('spaceID', Charging_set)
missingValues('spaceID', Charging_set)

Occuring data types:  [<class 'str'>]
Min. value:  AG-1F01 
Max. value:  CA-513
The column does not contain missing values.


You can see that the spaceIDs are structured differently (CA..., AG..., numbers)

### Column: stationID - unique identifier for the EVSE

In [25]:
dataTypes('stationID', Charging_set)
valueRange('stationID', Charging_set)
missingValues('stationID', Charging_set)

Occuring data types:  [<class 'str'>]
Min. value:  1-1-178-817 
Max. value:  2-39-95-444
The column does not contain missing values.


### deleted Column: timezone - timezone of the site, based on pytz format

In [26]:
dataTypes('timezone', Charging_set)
valueRange('timezone', Charging_set)
missingValues('timezone', Charging_set)
print( 'Occuring values: ', Charging_set['timezone'].unique())

Occuring data types:  [<class 'str'>]
Min. value:  America/Los_Angeles 
Max. value:  America/Los_Angeles
The column does not contain missing values.
Occuring values:  ['America/Los_Angeles']


The column 'timezone' only contains identical values. Therefore, it can be deleted.

In [27]:
Charging_set = Charging_set.drop(['timezone'], axis=1)

### new Column: userInformation - indication whether user information is given

In [28]:
Charging_set['userInformation'] = Charging_set['userID'].notnull() | Charging_set['userInputs'].notnull()

The new column indicates whether userID or userInputs is available. Dataset with user information: Charging_set[Charging_set['userInformation']]

### Column: userID - unique identifier of the user, if provided

In [29]:
dataTypes('userID', Charging_set)
valueRange('userID', Charging_set)
missingValues('userID', Charging_set)
if Charging_set['userID'].isnull().sum() == Charging_set['userID'].isna().sum() :
    print('All missing values are NaN values.')

Occuring data types:  [<class 'float'>]
Min. value:  1.0 
Max. value:  19923.0
The column contains 16285 missing values. This corresponds to 26.73 %.
All missing values are NaN values.


### Column: userInputs - inputs provided by the user

In [30]:
dataTypes('userInputs', Charging_set)
missingValues('userInputs', Charging_set)
if Charging_set['userInputs'].isnull().sum() == Charging_set['userInputs'].isna().sum() :
    print('All missing values are NaN values.')
print('Example of a string containing the user input:', Charging_set.at[3, 'userInputs'])

Occuring data types:  [<class 'str'> <class 'float'>]
The column contains 16285 missing values. This corresponds to 26.73 %.
All missing values are NaN values.
Example of a string containing the user input: [{'WhPerMile': 400, 'kWhRequested': 8.0, 'milesRequested': 20, 'minutesAvailable': 65, 'modifiedAt': 'Thu, 02 Jan 2020 14:00:03 GMT', 'paymentRequired': True, 'requestedDeparture': 'Thu, 02 Jan 2020 15:04:58 GMT', 'userID': 1117}, {'WhPerMile': 400, 'kWhRequested': 8.0, 'milesRequested': 20, 'minutesAvailable': 65, 'modifiedAt': 'Thu, 02 Jan 2020 14:00:19 GMT', 'paymentRequired': True, 'requestedDeparture': 'Thu, 02 Jan 2020 15:04:58 GMT', 'userID': 1117}]


### Dataset description

In [31]:
Charging_set.info()
Charging_set
Charging_set.to_csv('charging_sessions_preprocessed.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
Index: 60921 entries, 0 to 65036
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   id                   60921 non-null  object             
 1   connectionTime       60921 non-null  datetime64[ns, UTC]
 2   disconnectTime       60921 non-null  datetime64[ns, UTC]
 3   doneChargingTime     60921 non-null  datetime64[ns, UTC]
 4   kWhDelivered         60921 non-null  float64            
 5   sessionID            60921 non-null  object             
 6   siteID               60921 non-null  int64              
 7   spaceID              60921 non-null  object             
 8   stationID            60921 non-null  object             
 9   userID               44636 non-null  float64            
 10  userInputs           44636 non-null  object             
 11  totalConnectionTime  60921 non-null  timedelta64[ns]    
 12  totalChargingTime    60

## Weather Burbank Airport Dataset

Read in the data as a dataframe. Interpret the date columns as dates.

In [32]:
Weather_set = pd.read_csv('weather_burbank_airport.csv', parse_dates=['timestamp'])
print('Columns: ', Weather_set.columns)
print('Number of rows: ', Weather_set.shape[0])

Columns:  Index(['city', 'timestamp', 'temperature', 'cloud_cover',
       'cloud_cover_description', 'pressure', 'windspeed', 'precipitation',
       'felt_temperature'],
      dtype='object')
Number of rows:  29244


### Delete duplicates

In [33]:
numRows = Weather_set.shape[0]
Weather_set = Weather_set.drop_duplicates()
print('Number of duplicate rows: ', numRows-Weather_set.shape[0])

Number of duplicate rows:  0


### deleted Column: city

In [34]:
dataTypes('city', Weather_set)
valueRange('city', Weather_set)
missingValues('city', Weather_set)
print( 'Occuring values: ', Weather_set['city'].unique())

Occuring data types:  [<class 'str'>]
Min. value:  Burbank 
Max. value:  Burbank
The column does not contain missing values.
Occuring values:  ['Burbank']


The column 'city' only contains identical values. Therefore, it can be deleted.

In [35]:
Weather_set = Weather_set.drop(['city'], axis=1)

### Column: timestamp

In [36]:
dataTypes('timestamp', Weather_set)
valueRange('timestamp', Weather_set)
missingValues('timestamp', Weather_set)
if Weather_set['timestamp'].is_monotonic_increasing :
    print('The rows are ordered by increasing time.')
print('Occuring time intervals: ', Weather_set['timestamp'].diff().unique())
print('Number of rows in the Charging Sessions Dataset with values after 2021-01-01 07:53:00: ', len(Charging_set[Charging_set['connectionTime'] > '2021-01-01 07:53:00']))

Occuring data types:  [<class 'pandas._libs.tslibs.timestamps.Timestamp'>]
Min. value:  2018-01-01 08:53:00 
Max. value:  2021-01-01 07:53:00
The column does not contain missing values.
The rows are ordered by increasing time.
Occuring time intervals:  <TimedeltaArray>
[              NaT, '0 days 01:00:00', '0 days 00:43:00', '0 days 00:05:00',
 '0 days 00:12:00', '0 days 00:37:00', '0 days 00:23:00', '0 days 00:31:00',
 '0 days 00:17:00', '0 days 00:04:00', '0 days 00:11:00', '0 days 00:28:00',
 '0 days 00:41:00', '0 days 00:19:00', '0 days 00:39:00', '0 days 00:21:00',
 '0 days 00:38:00', '0 days 00:20:00', '0 days 00:02:00', '0 days 00:08:00',
 '0 days 00:15:00', '0 days 00:24:00', '0 days 00:14:00', '0 days 00:46:00',
 '0 days 00:36:00', '0 days 00:10:00', '0 days 00:09:00', '0 days 00:51:00',
 '0 days 00:48:00', '0 days 00:03:00', '0 days 00:30:00', '0 days 00:16:00',
 '0 days 00:42:00', '0 days 00:49:00', '0 days 00:29:00', '0 days 00:32:00',
 '0 days 00:44:00', '0 days 00:07:00'

The Charging Session Dataset contains data between 2018-04-25 11:08:04+00:00 and 2021-09-14 14:46:28+00:00.

The Weather Burbank Airport Dataset contains data between 2018-01-01 08:53:00 and 2021-01-01 07:53:00.

Problem: this is not the same time span!

### Column: temperature

In [37]:
dataTypes('temperature', Weather_set)
valueRange('temperature', Weather_set)
missingValues('temperature', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  2.0 
Max. value:  46.0
The column contains 25 missing values. This corresponds to 0.09 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [38]:
Weather_set['temperature'] = Weather_set['temperature'].ffill()
missingValues('temperature', Weather_set)

The column does not contain missing values.


### Column: cloud_cover

In [39]:
dataTypes('cloud_cover', Weather_set)
valueRange('cloud_cover', Weather_set)
missingValues('cloud_cover', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  4.0 
Max. value:  47.0
The column contains 20 missing values. This corresponds to 0.07 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [40]:
Weather_set['cloud_cover'] = Weather_set['cloud_cover'].ffill()
missingValues('cloud_cover', Weather_set)

The column does not contain missing values.


### Column: cloud_cover_description

In [41]:
dataTypes('cloud_cover_description', Weather_set)
missingValues('cloud_cover_description', Weather_set)

Occuring data types:  [<class 'str'> <class 'float'>]
The column contains 20 missing values. This corresponds to 0.07 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [42]:
Weather_set['cloud_cover_description'] = Weather_set['cloud_cover_description'].ffill()
missingValues('cloud_cover_description', Weather_set)

The column does not contain missing values.


In [43]:
Weather_set["cloud_cover_description"].value_counts()

cloud_cover_description
Fair                       17140
Cloudy                      4937
Partly Cloudy               2668
Mostly Cloudy               1831
Light Rain                   896
Haze                         579
Smoke                        329
Fog                          325
Rain                         247
Heavy Rain                   120
Fair / Windy                  74
T-Storm                       18
Thunder in the Vicinity       17
Partly Cloudy / Windy         14
Light Rain / Windy            10
Mostly Cloudy / Windy         10
Cloudy / Windy                 9
Heavy Rain / Windy             7
Blowing Dust                   5
Heavy T-Storm                  4
Rain / Windy                   2
Thunder                        1
Light Rain with Thunder        1
Name: count, dtype: int64

It could be useful to group the classes.

### Column: pressure

In [44]:
dataTypes('pressure', Weather_set)
valueRange('pressure', Weather_set)
missingValues('pressure', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  971.0 
Max. value:  999.65
The column contains 8 missing values. This corresponds to 0.03 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [45]:
Weather_set['pressure'] = Weather_set['pressure'].ffill()
missingValues('pressure', Weather_set)

The column does not contain missing values.


### Column: windspeed

In [46]:
dataTypes('windspeed', Weather_set)
valueRange('windspeed', Weather_set)
missingValues('windspeed', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  0.0 
Max. value:  57.0
The column contains 86 missing values. This corresponds to 0.29 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [47]:
Weather_set['windspeed'] = Weather_set['windspeed'].ffill()
missingValues('windspeed', Weather_set)

The column does not contain missing values.


### Column: precipitation

In [48]:
dataTypes('precipitation', Weather_set)
valueRange('precipitation', Weather_set)
missingValues('precipitation', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  0.0 
Max. value:  18.54
The column does not contain missing values.


### Column: felt_temperature

In [49]:
dataTypes('felt_temperature', Weather_set)
valueRange('felt_temperature', Weather_set)
missingValues('felt_temperature', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  0.0 
Max. value:  42.0
The column contains 26 missing values. This corresponds to 0.09 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [50]:
Weather_set['felt_temperature'] = Weather_set['felt_temperature'].ffill()
missingValues('felt_temperature', Weather_set)

The column does not contain missing values.


### Dataset description

In [51]:
Weather_set.info()
Weather_set
Weather_set.to_csv('weather_burbank_airport_preprocessed.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29244 entries, 0 to 29243
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   timestamp                29244 non-null  datetime64[ns]
 1   temperature              29244 non-null  float64       
 2   cloud_cover              29244 non-null  float64       
 3   cloud_cover_description  29244 non-null  object        
 4   pressure                 29244 non-null  float64       
 5   windspeed                29244 non-null  float64       
 6   precipitation            29244 non-null  float64       
 7   felt_temperature         29244 non-null  float64       
dtypes: datetime64[ns](1), float64(6), object(1)
memory usage: 1.8+ MB
