# Task 1: Data Collection and Preparation

In [1]:
import pandas as pd
import ast
import numpy as np

## Charging Sessions Dataset

Read in the data as a dataframe. Interpret the date columns as dates.

## Weather Burbank Airport Dataset

Read in the data as a dataframe. Interpret the date columns as dates.

In [17]:
Weather_set = pd.read_csv('weather_burbank_airport.csv', parse_dates=['timestamp'])
print('Columns: ', Weather_set.columns)
print('Number of rows: ', Weather_set.shape[0])

Columns:  Index(['city', 'timestamp', 'temperature', 'cloud_cover',
       'cloud_cover_description', 'pressure', 'windspeed', 'precipitation',
       'felt_temperature'],
      dtype='object')
Number of rows:  29244


We delete all rows that contain exactly the same values.

In [18]:
numRows = Weather_set.shape[0]
Weather_set = Weather_set.drop_duplicates()
print('Number of duplicate rows: ', numRows-Weather_set.shape[0])

Number of duplicate rows:  0


In [19]:
def missingValues(columnName, dataSet) :
    if (dataSet[columnName].isnull().any()):
        percentage = round(dataSet[columnName].isnull().sum()/dataSet.shape[0]*100, 2)
        print('The column contains', dataSet[columnName].isnull().sum(), 'missing values. This corresponds to', percentage, '%.')
    else:
        print('The column does not contain missing values.')

def valueRange(columnName, dataSet) :
    print('Min. value: ', dataSet[columnName].min(), '\nMax. value: ', dataSet[columnName].max())

def dataTypes(columnName, dataSet):
    print('Occuring data types: ', dataSet[columnName].apply(type).unique())

### deleted Column: city

In [20]:
dataTypes('city', Weather_set)
valueRange('city', Weather_set)
missingValues('city', Weather_set)
print( 'Occuring values: ', Weather_set['city'].unique())

Occuring data types:  [<class 'str'>]
Min. value:  Burbank 
Max. value:  Burbank
The column does not contain missing values.
Occuring values:  ['Burbank']


In [21]:


dataTypes('timestamp', Weather_set)
valueRange('timestamp', Weather_set)
missingValues('timestamp', Weather_set)
if Weather_set['timestamp'].is_monotonic_increasing :
    print('The rows are ordered by increasing time.')
print('Occuring time intervals: ', Weather_set['timestamp'].diff().unique())



Occuring data types:  [<class 'pandas._libs.tslibs.timestamps.Timestamp'>]
Min. value:  2018-01-01 08:53:00 
Max. value:  2021-01-01 07:53:00
The column does not contain missing values.
The rows are ordered by increasing time.
Occuring time intervals:  <TimedeltaArray>
[              NaT, '0 days 01:00:00', '0 days 00:43:00', '0 days 00:05:00',
 '0 days 00:12:00', '0 days 00:37:00', '0 days 00:23:00', '0 days 00:31:00',
 '0 days 00:17:00', '0 days 00:04:00', '0 days 00:11:00', '0 days 00:28:00',
 '0 days 00:41:00', '0 days 00:19:00', '0 days 00:39:00', '0 days 00:21:00',
 '0 days 00:38:00', '0 days 00:20:00', '0 days 00:02:00', '0 days 00:08:00',
 '0 days 00:15:00', '0 days 00:24:00', '0 days 00:14:00', '0 days 00:46:00',
 '0 days 00:36:00', '0 days 00:10:00', '0 days 00:09:00', '0 days 00:51:00',
 '0 days 00:48:00', '0 days 00:03:00', '0 days 00:30:00', '0 days 00:16:00',
 '0 days 00:42:00', '0 days 00:49:00', '0 days 00:29:00', '0 days 00:32:00',
 '0 days 00:44:00', '0 days 00:07:00'

The column 'city' only contains identical values. Therefore, it can be deleted.

In [22]:
Weather_set = Weather_set.drop(['city'], axis=1)

### Column: timestamp

The Charging Session Dataset contains data between 2018-04-25 11:08:04+00:00 and 2021-09-14 14:46:28+00:00.

The Weather Burbank Airport Dataset contains data between 2018-01-01 08:53:00 and 2021-01-01 07:53:00.

Problem: this is not the same time span!

### Column: temperature

In [23]:
dataTypes('temperature', Weather_set)
valueRange('temperature', Weather_set)
missingValues('temperature', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  2.0 
Max. value:  46.0
The column contains 25 missing values. This corresponds to 0.09 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [24]:
Weather_set['temperature'] = Weather_set['temperature'].ffill()
missingValues('temperature', Weather_set)

The column does not contain missing values.


### Column: cloud_cover

In [25]:
dataTypes('cloud_cover', Weather_set)
valueRange('cloud_cover', Weather_set)
missingValues('cloud_cover', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  4.0 
Max. value:  47.0
The column contains 20 missing values. This corresponds to 0.07 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [26]:
Weather_set['cloud_cover'] = Weather_set['cloud_cover'].ffill()
missingValues('cloud_cover', Weather_set)

The column does not contain missing values.


### Column: cloud_cover_description

In [27]:
dataTypes('cloud_cover_description', Weather_set)
missingValues('cloud_cover_description', Weather_set)

Occuring data types:  [<class 'str'> <class 'float'>]
The column contains 20 missing values. This corresponds to 0.07 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [28]:
Weather_set['cloud_cover_description'] = Weather_set['cloud_cover_description'].ffill()
missingValues('cloud_cover_description', Weather_set)

The column does not contain missing values.


In [29]:
Weather_set["cloud_cover_description"].value_counts()

cloud_cover_description
Fair                       17140
Cloudy                      4937
Partly Cloudy               2668
Mostly Cloudy               1831
Light Rain                   896
Haze                         579
Smoke                        329
Fog                          325
Rain                         247
Heavy Rain                   120
Fair / Windy                  74
T-Storm                       18
Thunder in the Vicinity       17
Partly Cloudy / Windy         14
Light Rain / Windy            10
Mostly Cloudy / Windy         10
Cloudy / Windy                 9
Heavy Rain / Windy             7
Blowing Dust                   5
Heavy T-Storm                  4
Rain / Windy                   2
Thunder                        1
Light Rain with Thunder        1
Name: count, dtype: int64

It could be useful to group the classes.

### Column: pressure

In [30]:
dataTypes('pressure', Weather_set)
valueRange('pressure', Weather_set)
missingValues('pressure', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  971.0 
Max. value:  999.65
The column contains 8 missing values. This corresponds to 0.03 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [31]:
Weather_set['pressure'] = Weather_set['pressure'].ffill()
missingValues('pressure', Weather_set)

The column does not contain missing values.


### Column: windspeed

In [32]:
dataTypes('windspeed', Weather_set)
valueRange('windspeed', Weather_set)
missingValues('windspeed', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  0.0 
Max. value:  57.0
The column contains 86 missing values. This corresponds to 0.29 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [33]:
Weather_set['windspeed'] = Weather_set['windspeed'].ffill()
missingValues('windspeed', Weather_set)

The column does not contain missing values.


### Column: precipitation

In [34]:
dataTypes('precipitation', Weather_set)
valueRange('precipitation', Weather_set)
missingValues('precipitation', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  0.0 
Max. value:  18.54
The column does not contain missing values.


### Column: felt_temperature

In [35]:
dataTypes('felt_temperature', Weather_set)
valueRange('felt_temperature', Weather_set)
missingValues('felt_temperature', Weather_set)

Occuring data types:  [<class 'float'>]
Min. value:  0.0 
Max. value:  42.0
The column contains 26 missing values. This corresponds to 0.09 %.


In order not to lose rows with missing values, we replace these. As the data ist sorted by time and therefore the true values are close to the surrounding values, we perform a forward fill for the missing values.

In [36]:
Weather_set['felt_temperature'] = Weather_set['felt_temperature'].ffill()
missingValues('felt_temperature', Weather_set)

The column does not contain missing values.


### Dataset description

In [37]:
Weather_set.info()
Weather_set
Weather_set.to_csv('weather_burbank_airport_preprocessed.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29244 entries, 0 to 29243
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   timestamp                29244 non-null  datetime64[ns]
 1   temperature              29244 non-null  float64       
 2   cloud_cover              29244 non-null  float64       
 3   cloud_cover_description  29244 non-null  object        
 4   pressure                 29244 non-null  float64       
 5   windspeed                29244 non-null  float64       
 6   precipitation            29244 non-null  float64       
 7   felt_temperature         29244 non-null  float64       
dtypes: datetime64[ns](1), float64(6), object(1)
memory usage: 1.8+ MB
