# Task 1: Data Collection and Preparation

In [1]:
import pandas as pd

## Charging Sessions Dataset

Read in the data as a dataframe. Interpret the date columns as dates.

In [2]:
Charging_set = pd.read_csv('charging_sessions.csv', parse_dates=['connectionTime','disconnectTime','doneChargingTime'])
print('Columns: ', Charging_set.columns)
print('Number of rows: ', Charging_set.shape[0])

Columns:  Index(['Unnamed: 0', 'id', 'connectionTime', 'disconnectTime',
       'doneChargingTime', 'kWhDelivered', 'sessionID', 'siteID', 'spaceID',
       'stationID', 'timezone', 'userID', 'userInputs'],
      dtype='object')
Number of rows:  66450


Define functions to analyze the columns.

In [3]:
def missingValues(columnName, dataSet) :
    if (dataSet[columnName].isnull().any()):
        percentage = round(dataSet[columnName].isnull().sum()/Charging_set.shape[0]*100, 2)
        print('The column contains', dataSet[columnName].isnull().sum(), 'missing values. This corresponds to', percentage, '%.')
    else:
        print('The column does not contain missing values.')

def valueRange(columnName, dataSet) :
    print('Min. value: ', dataSet[columnName].min(), '\nMax. value: ', dataSet[columnName].max())

def dataTypes(columnName, dataSet):
    print('Occuring data types: ', dataSet[columnName].apply(type).unique())

### Column: number

Rename the unnamed column.

In [4]:
Charging_set.rename(columns={'Unnamed: 0': 'number'}, inplace=True)
dataTypes('number', Charging_set)
valueRange('number', Charging_set)
missingValues('number', Charging_set)

Occuring data types:  [<class 'int'>]
Min. value:  0 
Max. value:  15291
The column does not contain missing values.


BRAUCHEN WIR DIE SPALTE?

### Column: id

In [5]:
dataTypes('id', Charging_set)
valueRange('id', Charging_set)
missingValues('id', Charging_set)

Occuring data types:  [<class 'str'>]
Min. value:  5bc90cb9f9af8b0d7fe77cd2 
Max. value:  6155053bf9af8b76960e16d1
The column does not contain missing values.


### Column: connectionTime

In [6]:
dataTypes('connectionTime', Charging_set)
valueRange('connectionTime', Charging_set)
missingValues('connectionTime', Charging_set)

Occuring data types:  [<class 'pandas._libs.tslibs.timestamps.Timestamp'>]
Min. value:  2018-04-25 11:08:04+00:00 
Max. value:  2021-09-14 05:43:39+00:00
The column does not contain missing values.


### Column: disconnectTime

In [7]:
dataTypes('disconnectTime', Charging_set)
valueRange('disconnectTime', Charging_set)
missingValues('disconnectTime', Charging_set)

Occuring data types:  [<class 'pandas._libs.tslibs.timestamps.Timestamp'>]
Min. value:  2018-04-25 13:20:10+00:00 
Max. value:  2021-09-14 14:46:28+00:00
The column does not contain missing values.


### Column: doneChargingTime

In [8]:
dataTypes('doneChargingTime', Charging_set)
valueRange('doneChargingTime', Charging_set)
missingValues('doneChargingTime', Charging_set)

Occuring data types:  [<class 'pandas._libs.tslibs.timestamps.Timestamp'>
 <class 'pandas._libs.tslibs.nattype.NaTType'>]
Min. value:  2018-04-25 13:21:10+00:00 
Max. value:  2021-09-14 14:46:22+00:00
The column contains 4088 missing values. This corresponds to 6.15 %.


WOHER KOMMEN DIE FEHLENDEN WERTE?

-> LIEGT NICHT DARAN, DASS NICHT GELADEN WURDE

### new Column: totalConnectionTime

In [9]:
Charging_set['totalConnectionTime'] = Charging_set['disconnectTime'] - Charging_set['connectionTime']
dataTypes('totalConnectionTime', Charging_set)
valueRange('totalConnectionTime', Charging_set)
missingValues('totalConnectionTime', Charging_set)

Occuring data types:  [<class 'pandas._libs.tslibs.timedeltas.Timedelta'>]
Min. value:  0 days 00:02:04 
Max. value:  10 days 05:16:09
The column does not contain missing values.


### new Column: totalChargingTime

In [10]:
Charging_set['totalChargingTime'] = Charging_set['doneChargingTime'] - Charging_set['connectionTime']
dataTypes('totalChargingTime', Charging_set)
valueRange('totalChargingTime', Charging_set)
missingValues('totalChargingTime', Charging_set)

Occuring data types:  [<class 'pandas._libs.tslibs.timedeltas.Timedelta'>
 <class 'pandas._libs.tslibs.nattype.NaTType'>]
Min. value:  -1 days +23:18:38 
Max. value:  8 days 08:00:57
The column contains 4088 missing values. This corresponds to 6.15 %.


 WARUM IST doneChargingTime > disconnectTime?

### Column: kWhDelivered

In [11]:
dataTypes('kWhDelivered', Charging_set)
valueRange('kWhDelivered', Charging_set)
missingValues('kWhDelivered', Charging_set)

Occuring data types:  [<class 'float'>]
Min. value:  0.501 
Max. value:  108.79724166666666
The column does not contain missing values.


### Column: sessionID

In [12]:
dataTypes('sessionID', Charging_set)
valueRange('sessionID', Charging_set)
missingValues('sessionID', Charging_set)

Occuring data types:  [<class 'str'>]
Min. value:  1_1_178_817_2018-09-13 15:24:32.185314 
Max. value:  2_39_95_444_2021-07-14 16:54:32.800222
The column does not contain missing values.


### Column: siteID

In [13]:
dataTypes('siteID', Charging_set)
valueRange('siteID', Charging_set)
missingValues('siteID', Charging_set)

Occuring data types:  [<class 'int'>]
Min. value:  1 
Max. value:  2
The column does not contain missing values.


### Column: spaceID

In [14]:
dataTypes('spaceID', Charging_set)
valueRange('spaceID', Charging_set)
missingValues('spaceID', Charging_set)

Occuring data types:  [<class 'str'>]
Min. value:  11900388 
Max. value:  CA-513
The column does not contain missing values.


SIND UNTERSCHIEDLICHE ID TYPEN (CA..., AG..., ZAHLEN) EIN PROBLEM?

### Column: stationID

In [15]:
dataTypes('stationID', Charging_set)
valueRange('stationID', Charging_set)
missingValues('stationID', Charging_set)

Occuring data types:  [<class 'str'>]
Min. value:  1-1-178-817 
Max. value:  2-39-95-444
The column does not contain missing values.


### Column: timezone

In [16]:
dataTypes('timezone', Charging_set)
valueRange('timezone', Charging_set)
missingValues('timezone', Charging_set)
print( 'Occuring values: ', Charging_set['timezone'].unique())

Occuring data types:  [<class 'str'>]
Min. value:  America/Los_Angeles 
Max. value:  America/Los_Angeles
The column does not contain missing values.
Occuring values:  ['America/Los_Angeles']


The column 'timezone' only contains identical values. Therefore, it can be deleted.

In [17]:
Charging_set = Charging_set.drop(['timezone'], axis=1)

### Column: userID

In [40]:
dataTypes('userID', Charging_set)
valueRange('userID', Charging_set)
missingValues('userID', Charging_set)

Occuring data types:  [<class 'float'>]
Min. value:  1.0 
Max. value:  19923.0
The column contains 17263 missing values. This corresponds to 25.98 %.


SOLLEN NaN WERTE ERSETZT WERDEN?

### Column: userInputs

In [19]:
dataTypes('userInputs', Charging_set)
missingValues('userInputs', Charging_set)
Charging_set.at[3, 'userInputs']

Occuring data types:  [<class 'str'> <class 'float'>]
The column contains 17263 missing values. This corresponds to 25.98 %.


"[{'WhPerMile': 400, 'kWhRequested': 8.0, 'milesRequested': 20, 'minutesAvailable': 65, 'modifiedAt': 'Thu, 02 Jan 2020 14:00:03 GMT', 'paymentRequired': True, 'requestedDeparture': 'Thu, 02 Jan 2020 15:04:58 GMT', 'userID': 1117}, {'WhPerMile': 400, 'kWhRequested': 8.0, 'milesRequested': 20, 'minutesAvailable': 65, 'modifiedAt': 'Thu, 02 Jan 2020 14:00:19 GMT', 'paymentRequired': True, 'requestedDeparture': 'Thu, 02 Jan 2020 15:04:58 GMT', 'userID': 1117}]"

PROBLEM: TEILWEISE SIND MEHRERE USER INPUTS VORHANDEN

### Dataset description

In [20]:
Charging_set.info()
Charging_set

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66450 entries, 0 to 66449
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   number               66450 non-null  int64              
 1   id                   66450 non-null  object             
 2   connectionTime       66450 non-null  datetime64[ns, UTC]
 3   disconnectTime       66450 non-null  datetime64[ns, UTC]
 4   doneChargingTime     62362 non-null  datetime64[ns, UTC]
 5   kWhDelivered         66450 non-null  float64            
 6   sessionID            66450 non-null  object             
 7   siteID               66450 non-null  int64              
 8   spaceID              66450 non-null  object             
 9   stationID            66450 non-null  object             
 10  userID               49187 non-null  float64            
 11  userInputs           49187 non-null  object             
 12  totalConnectionTim

Unnamed: 0,number,id,connectionTime,disconnectTime,doneChargingTime,kWhDelivered,sessionID,siteID,spaceID,stationID,userID,userInputs,totalConnectionTime,totalChargingTime
0,0,5e23b149f9af8b5fe4b973cf,2020-01-02 13:08:54+00:00,2020-01-02 19:11:15+00:00,2020-01-02 17:31:35+00:00,25.016,1_1_179_810_2020-01-02 13:08:53.870034,1,AG-3F30,1-1-179-810,194.0,"[{'WhPerMile': 250, 'kWhRequested': 25.0, 'mil...",0 days 06:02:21,0 days 04:22:41
1,1,5e23b149f9af8b5fe4b973d0,2020-01-02 13:36:50+00:00,2020-01-02 22:38:21+00:00,2020-01-02 20:18:05+00:00,33.097,1_1_193_825_2020-01-02 13:36:49.599853,1,AG-1F01,1-1-193-825,4275.0,"[{'WhPerMile': 280, 'kWhRequested': 70.0, 'mil...",0 days 09:01:31,0 days 06:41:15
2,2,5e23b149f9af8b5fe4b973d1,2020-01-02 13:56:35+00:00,2020-01-03 00:39:22+00:00,2020-01-02 16:35:06+00:00,6.521,1_1_193_829_2020-01-02 13:56:35.214993,1,AG-1F03,1-1-193-829,344.0,"[{'WhPerMile': 400, 'kWhRequested': 8.0, 'mile...",0 days 10:42:47,0 days 02:38:31
3,3,5e23b149f9af8b5fe4b973d2,2020-01-02 13:59:58+00:00,2020-01-02 16:38:39+00:00,2020-01-02 15:18:45+00:00,2.355,1_1_193_820_2020-01-02 13:59:58.309319,1,AG-1F04,1-1-193-820,1117.0,"[{'WhPerMile': 400, 'kWhRequested': 8.0, 'mile...",0 days 02:38:41,0 days 01:18:47
4,4,5e23b149f9af8b5fe4b973d3,2020-01-02 14:00:01+00:00,2020-01-02 22:08:40+00:00,2020-01-02 18:17:30+00:00,13.375,1_1_193_819_2020-01-02 14:00:00.779967,1,AG-1F06,1-1-193-819,334.0,"[{'WhPerMile': 400, 'kWhRequested': 16.0, 'mil...",0 days 08:08:39,0 days 04:17:29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66445,10083,5d574ad2f9af8b4c10c03652,2019-07-31 18:08:04+00:00,2019-07-31 23:29:18+00:00,2019-07-31 23:30:18+00:00,28.787,1_1_179_809_2019-07-31 18:08:04.432654,1,AG-3F27,1-1-179-809,393.0,"[{'WhPerMile': 240, 'kWhRequested': 31.2, 'mil...",0 days 05:21:14,0 days 05:22:14
66446,10084,5d574ad2f9af8b4c10c03653,2019-07-31 18:40:41+00:00,2019-08-01 00:59:42+00:00,2019-07-31 21:44:23+00:00,7.787,1_1_179_810_2019-07-31 18:40:40.900203,1,AG-3F30,1-1-179-810,220.0,"[{'WhPerMile': 333, 'kWhRequested': 6.66, 'mil...",0 days 06:19:01,0 days 03:03:42
66447,10085,5d574ad2f9af8b4c10c03654,2019-07-31 19:04:40+00:00,2019-07-31 22:44:22+00:00,2019-07-31 22:45:21+00:00,11.274,1_1_191_795_2019-07-31 19:04:40.098273,1,AG-4F51,1-1-191-795,1974.0,"[{'WhPerMile': 333, 'kWhRequested': 19.98, 'mi...",0 days 03:39:42,0 days 03:40:41
66448,10086,5d574ad2f9af8b4c10c03655,2019-07-31 19:19:47+00:00,2019-08-01 00:34:51+00:00,2019-07-31 21:25:30+00:00,11.589,1_1_191_778_2019-07-31 19:19:46.919358,1,AG-4F43,1-1-191-778,942.0,"[{'WhPerMile': 275, 'kWhRequested': 22.0, 'mil...",0 days 05:15:04,0 days 02:05:43


## Weather Burbank Airport Dataset

Read in the data as a dataframe. Interpret the date columns as dates.

In [21]:
Weather_set = pd.read_csv('weather_burbank_airport.csv', parse_dates=['timestamp'])
print('Columns: ', Weather_set.columns)
print('Number of rows: ', Weather_set.shape[0])

Columns:  Index(['city', 'timestamp', 'temperature', 'cloud_cover',
       'cloud_cover_description', 'pressure', 'windspeed', 'precipitation',
       'felt_temperature'],
      dtype='object')
Number of rows:  29244


### Column: city

In [22]:
dataTypes('city', Weather_set)
valueRange('city', Weather_set)
missingValues('city', Weather_set)
print( 'Occuring values: ', Weather_set['city'].unique())

Occuring data types:  [<class 'str'>]
Min. value:  Burbank 
Max. value:  Burbank
The column does not contain missing values.
Occuring values:  ['Burbank']


The column 'city' only contains identical values. Therefore, it can be deleted.

In [23]:
Weather_set = Weather_set.drop(['city'], axis=1)

### Column: timestamp

In [24]:
dataTypes('timestamp', Weather_set)
valueRange('timestamp', Weather_set)
missingValues('timestamp', Weather_set)
if Weather_set['timestamp'].is_monotonic_increasing :
    print('The rows are ordered by increasing time.')

Occuring data types:  [<class 'pandas._libs.tslibs.timestamps.Timestamp'>]
Min. value:  2018-01-01 08:53:00 
Max. value:  2021-01-01 07:53:00
The column does not contain missing values.
The rows are ordered by increasing time.


The Charging Session Dataset contains data between 2018-04-25 11:08:04+00:00 and 2021-09-14 14:46:28+00:00.

The Weather Burbank Airport Dataset contains data between 2018-01-01 08:53:00 and 2021-01-01 07:53:00.

Problem: this is not the same time span!

### Column: temperature

In [25]:
dataTypes('temperature', Weather_set)
valueRange('temperature', Weather_set)
missingValues('temperature', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  2.0 
Max. value:  46.0
The column contains 25 missing values. This corresponds to 0.04 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [26]:
Weather_set['temperature'] = Weather_set['temperature'].ffill()
missingValues('temperature', Weather_set)

The column does not contain missing values.


### Column: cloud_cover

In [27]:
dataTypes('cloud_cover', Weather_set)
valueRange('cloud_cover', Weather_set)
missingValues('cloud_cover', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  4.0 
Max. value:  47.0
The column contains 20 missing values. This corresponds to 0.03 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [28]:
Weather_set['cloud_cover'] = Weather_set['cloud_cover'].ffill()
missingValues('cloud_cover', Weather_set)

The column does not contain missing values.


### Column: cloud_cover_description

In [29]:
dataTypes('cloud_cover_description', Weather_set)
missingValues('cloud_cover_description', Weather_set)

Occuring data types:  [<class 'str'> <class 'float'>]
The column contains 20 missing values. This corresponds to 0.03 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [30]:
Weather_set['cloud_cover_description'] = Weather_set['cloud_cover_description'].ffill()
missingValues('cloud_cover_description', Weather_set)

The column does not contain missing values.


In [42]:
Weather_set["cloud_cover_description"].value_counts()

cloud_cover_description
Fair                       17140
Cloudy                      4937
Partly Cloudy               2668
Mostly Cloudy               1831
Light Rain                   896
Haze                         579
Smoke                        329
Fog                          325
Rain                         247
Heavy Rain                   120
Fair / Windy                  74
T-Storm                       18
Thunder in the Vicinity       17
Partly Cloudy / Windy         14
Light Rain / Windy            10
Mostly Cloudy / Windy         10
Cloudy / Windy                 9
Heavy Rain / Windy             7
Blowing Dust                   5
Heavy T-Storm                  4
Rain / Windy                   2
Thunder                        1
Light Rain with Thunder        1
Name: count, dtype: int64

WIE SOLLEN DIE KATEGORISCHEN DATEN GESPEICHERT WERDEN?

### Column: pressure

In [31]:
dataTypes('pressure', Weather_set)
valueRange('pressure', Weather_set)
missingValues('pressure', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  971.0 
Max. value:  999.65
The column contains 8 missing values. This corresponds to 0.01 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [32]:
Weather_set['pressure'] = Weather_set['pressure'].ffill()
missingValues('pressure', Weather_set)

The column does not contain missing values.


### Column: windspeed

In [33]:
dataTypes('windspeed', Weather_set)
valueRange('windspeed', Weather_set)
missingValues('windspeed', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  0.0 
Max. value:  57.0
The column contains 86 missing values. This corresponds to 0.13 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [34]:
Weather_set['windspeed'] = Weather_set['windspeed'].ffill()
missingValues('windspeed', Weather_set)

The column does not contain missing values.


### Column: precipitation

In [35]:
dataTypes('precipitation', Weather_set)
valueRange('precipitation', Weather_set)
missingValues('precipitation', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  0.0 
Max. value:  18.54
The column does not contain missing values.


### Column: felt_temperature

In [36]:
dataTypes('felt_temperature', Weather_set)
valueRange('felt_temperature', Weather_set)
missingValues('felt_temperature', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  0.0 
Max. value:  42.0
The column contains 26 missing values. This corresponds to 0.04 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [37]:
Weather_set['felt_temperature'] = Weather_set['felt_temperature'].ffill()
missingValues('felt_temperature', Weather_set)

The column does not contain missing values.


### Dataset description

In [38]:
Weather_set.info()
Weather_set

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29244 entries, 0 to 29243
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   timestamp                29244 non-null  datetime64[ns]
 1   temperature              29244 non-null  float64       
 2   cloud_cover              29244 non-null  float64       
 3   cloud_cover_description  29244 non-null  object        
 4   pressure                 29244 non-null  float64       
 5   windspeed                29244 non-null  float64       
 6   precipitation            29244 non-null  float64       
 7   felt_temperature         29244 non-null  float64       
dtypes: datetime64[ns](1), float64(6), object(1)
memory usage: 1.8+ MB


Unnamed: 0,timestamp,temperature,cloud_cover,cloud_cover_description,pressure,windspeed,precipitation,felt_temperature
0,2018-01-01 08:53:00,9.0,33.0,Fair,991.75,9.0,0.0,8.0
1,2018-01-01 09:53:00,9.0,33.0,Fair,992.08,0.0,0.0,9.0
2,2018-01-01 10:53:00,9.0,21.0,Haze,992.08,0.0,0.0,9.0
3,2018-01-01 11:53:00,9.0,29.0,Partly Cloudy,992.08,0.0,0.0,9.0
4,2018-01-01 12:53:00,8.0,33.0,Fair,992.08,0.0,0.0,8.0
...,...,...,...,...,...,...,...,...
29239,2021-01-01 03:53:00,13.0,33.0,Fair,986.81,0.0,0.0,13.0
29240,2021-01-01 04:53:00,12.0,33.0,Fair,986.81,11.0,0.0,12.0
29241,2021-01-01 05:53:00,12.0,33.0,Fair,987.47,9.0,0.0,12.0
29242,2021-01-01 06:53:00,11.0,33.0,Fair,987.14,13.0,0.0,11.0
