# Data Extraction and Processing

This notebook walks through the data extraction and processing pipeline for recorded solar power generation, installed capacity, and weather. Generation forecasts are also pulled for benchmarking purposes.

## Setup

In [1]:
# Import packages
import yaml
import pathlib
import pandas as pd

# Reload modules in case changes are made to the codebase
%reload_ext autoreload
%autoreload 2

# Import modules
from lib import generation_processor as gp
from lib import capacity_processor as cp
from lib import weather_puller as wp
from lib import weather_parser as wpr


### Set display preferences
# Tabular
pd.set_option('display.max_columns',500)
pd.set_option('display.max_rows',500)

# Warnings
import warnings; warnings.filterwarnings(action='ignore')

## Solar Power Generation

### Extraction

The [ENTSOE Transparency Platform](https://transparency.entsoe.eu/) publishes actual and forecast power generation, by source, for most European grid operators. The platform's API is a little tedious but there is a data view that gets us what we need. Actual generation can be found [here](https://transparency.entsoe.eu/generation/r2/actualGenerationPerProductionType/show), while forecasts are available [here](https://transparency.entsoe.eu/generation/r2/dayAheadGenerationForecastWindAndSolar/show).

I pulled actuals and forecasts for the countries in-scope (Germany, France, Spain, and Italy) for 2016, 2017, and 2018. The process was fairly painless, but if you'd rather automate it you could use the [API](https://transparency.entsoe.eu/content/static_content/Static%20content/web%20api/Guide.html) or write a quick Selenium script. The German grid is managed by four different operators (TenneT, 50Hertz, Amprion, and Transnet), while the French, Spanish, and Italian grids are each managed by a single operator. 

**German Transmission System Operators**
<img src="images/Regelzonen_deutscher_Übertragungsnetzbetreiber_neu.png" height= "250" width="400">

### Processing

The generation_processor module, and the two parser classes within it, handle the files output by the transparency platform. They read the files, clean dates and names, filter for columns of interest, and add convenience columns for exploration and modeling. The cleaned data is written to the processed_data folder. You can take a look at the code and the available arguments by typing `??gp.ActualGenerationParser` or `??gp.ForecastGenerationParser`.

Usage is shown below. 

#### Actuals

In [2]:
agp = gp.ActualGenerationParser()
agp.parse()
print(agp.actuals.shape)
agp.actuals.head(5)

Written
(473079, 11)


Unnamed: 0_level_0,operator,solar,int_start,year,month,week,hour,minute,base_hour,month_year,week_year
int_start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2018-01-01 00:00:00,FR,0,2018-01-01 00:00:00,2018,1,1,0,0,2018-01-01 00:00:00,2018-01,2018-01-01/2018-01-07
2018-01-01 01:00:00,FR,0,2018-01-01 01:00:00,2018,1,1,1,0,2018-01-01 01:00:00,2018-01,2018-01-01/2018-01-07
2018-01-01 02:00:00,FR,0,2018-01-01 02:00:00,2018,1,1,2,0,2018-01-01 02:00:00,2018-01,2018-01-01/2018-01-07
2018-01-01 03:00:00,FR,0,2018-01-01 03:00:00,2018,1,1,3,0,2018-01-01 03:00:00,2018-01,2018-01-01/2018-01-07
2018-01-01 04:00:00,FR,0,2018-01-01 04:00:00,2018,1,1,4,0,2018-01-01 04:00:00,2018-01,2018-01-01/2018-01-07


The column **'solar'** details the **Megawatt Hours (MWh)** of solar power generated within an operator's jurisdiction during a given time interval

Interestingly, not all operators report data at the same intervals - the Germans report numbers for each 15 minute interval, while the remaining operators only provide hourly information.

In [3]:
sorted(agp.int_lengths.items(), key=lambda x: x[1])

[('DE(TenneT GER)', 15.0),
 ('DE(TransnetBW)', 15.0),
 ('DE(50Hertz)', 15.0),
 ('DE(Amprion)', 15.0),
 ('FR', 60.0),
 ('ES', 60.0),
 ('IT', 60.0)]

To enable an apples to apples comparison across operators, the parser modules contain an **Hourly flag** that converts granular data into hourly aggregates

In [4]:
agp = gp.ActualGenerationParser(hourly=True)
agp.parse()
print(agp.actuals.shape)
agp.actuals.head()

Written
(174271, 11)


Unnamed: 0_level_0,operator,int_start,solar,year,month,week,hour,minute,base_hour,month_year,week_year
int_start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2016-01-01 00:00:00,DE(50Hertz),2016-01-01 00:00:00,0.0,2016,1,53,0,0,2016-01-01 00:00:00,2016-01,2015-12-28/2016-01-03
2016-01-01 01:00:00,DE(50Hertz),2016-01-01 01:00:00,0.0,2016,1,53,1,0,2016-01-01 01:00:00,2016-01,2015-12-28/2016-01-03
2016-01-01 02:00:00,DE(50Hertz),2016-01-01 02:00:00,0.0,2016,1,53,2,0,2016-01-01 02:00:00,2016-01,2015-12-28/2016-01-03
2016-01-01 03:00:00,DE(50Hertz),2016-01-01 03:00:00,0.0,2016,1,53,3,0,2016-01-01 03:00:00,2016-01,2015-12-28/2016-01-03
2016-01-01 04:00:00,DE(50Hertz),2016-01-01 04:00:00,0.0,2016,1,53,4,0,2016-01-01 04:00:00,2016-01,2015-12-28/2016-01-03


#### Forecasts

While raw forecast data is structured differently, its processing follows a similar pattern

In [5]:
fgp = gp.ForecastGenerationParser()
fgp.parse()
print(fgp.forecasts.shape)
fgp.forecasts.head()

Written
(315420, 13)


Unnamed: 0_level_0,operator,solar_da,solar_id,solar_c,int_start,year,month,week,hour,minute,base_hour,month_year,week_year
int_start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2018-01-01 00:00:00,DE(TenneT GER),0,0,,2018-01-01 00:00:00,2018,1,1,0,0,2018-01-01 00:00:00,2018-01,2018-01-01/2018-01-07
2018-01-01 00:15:00,DE(TenneT GER),0,0,,2018-01-01 00:15:00,2018,1,1,0,15,2018-01-01 00:00:00,2018-01,2018-01-01/2018-01-07
2018-01-01 00:30:00,DE(TenneT GER),0,0,,2018-01-01 00:30:00,2018,1,1,0,30,2018-01-01 00:00:00,2018-01,2018-01-01/2018-01-07
2018-01-01 00:45:00,DE(TenneT GER),0,0,,2018-01-01 00:45:00,2018,1,1,0,45,2018-01-01 00:00:00,2018-01,2018-01-01/2018-01-07
2018-01-01 01:00:00,DE(TenneT GER),0,0,,2018-01-01 01:00:00,2018,1,1,1,0,2018-01-01 01:00:00,2018-01,2018-01-01/2018-01-07


Forecasts are provided at the same intervals as actuals

In [6]:
sorted(fgp.int_lengths.items(), key=lambda x: x[1])

[('DE(TenneT GER)', 15.0),
 ('DE(TransnetBW)', 15.0),
 ('DE(Amprion)', 15.0),
 ('DE(50Hertz)', 15.0),
 ('ES', 60.0),
 ('IT', 60.0),
 ('FR', 60.0)]

In [7]:
fgp = gp.ForecastGenerationParser(hourly=True)
fgp.parse()
print(fgp.forecasts.shape)
fgp.forecasts.head()

Written
(96837, 11)


Unnamed: 0_level_0,operator,int_start,solar_da,year,month,week,hour,minute,base_hour,month_year,week_year
int_start,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2017-01-01 00:00:00,DE(50Hertz),2017-01-01 00:00:00,0.0,2017,1,52,0,0,2017-01-01 00:00:00,2017-01,2016-12-26/2017-01-01
2017-01-01 01:00:00,DE(50Hertz),2017-01-01 01:00:00,0.0,2017,1,52,1,0,2017-01-01 01:00:00,2017-01,2016-12-26/2017-01-01
2017-01-01 02:00:00,DE(50Hertz),2017-01-01 02:00:00,0.0,2017,1,52,2,0,2017-01-01 02:00:00,2017-01,2016-12-26/2017-01-01
2017-01-01 03:00:00,DE(50Hertz),2017-01-01 03:00:00,0.0,2017,1,52,3,0,2017-01-01 03:00:00,2017-01,2016-12-26/2017-01-01
2017-01-01 04:00:00,DE(50Hertz),2017-01-01 04:00:00,0.0,2017,1,52,4,0,2017-01-01 04:00:00,2017-01,2016-12-26/2017-01-01


## Installed Capacity

The ENTSOE platform also [publishes](https://transparency.entsoe.eu/generation/r2/installedGenerationCapacityAggregation/show?name=&defaultValue=false&viewType=TABLE&areaType=CTA&atch=false&dateTime.dateTime=01.01.2018+00:00|UTC|YEAR&dateTime.endDateTime=01.01.2018+00:00|UTC|YEAR&area.values=CTY|10Y1001A1001A83F!CTA|10YDE-VE-------2&productionType.values=B01&productionType.values=B02&productionType.values=B03&productionType.values=B04&productionType.values=B05&productionType.values=B06&productionType.values=B07&productionType.values=B08&productionType.values=B09&productionType.values=B10&productionType.values=B11&productionType.values=B12&productionType.values=B13&productionType.values=B14&productionType.values=B20&productionType.values=B15&productionType.values=B16&productionType.values=B17&productionType.values=B18&productionType.values=B19) installed capacity in **MW** by source for each operator. Unfortunately, this information is only available on an annual basis. Any model developed using this data will have to be retrained more frequently than if more granular data were available.

The CapacityParser class cleans and parses files output by the platform.

In [8]:
cpr = cp.CapacityParser()
cpr.parse()
cpr.cap_data.head(10)

Written


Unnamed: 0,operator,year,solcap
7,FR,2016,397
8,DE(TenneT GER),2016,15156
9,DE(50Hertz),2016,8901
10,DE(TransnetBW),2016,5449
11,ES,2016,6500
12,DE(Amprion),2016,9334
13,IT,2016,4768
14,FR,2017,378
15,DE(TenneT GER),2017,15524
16,DE(50Hertz),2017,9749


## Weather

The coordinates selected should approximate the geographical spread of solar capacity within the region of interest. Operators should be able to do this quite accurately, but since we're not privy to the same information they are, we'll pick locations that are well-distributed and have higher rates of solar insolation.

**Selected Coordinates**

In [24]:
with open('coordinates.yml', 'r') as file:
    coords = yaml.load(file)
pd.DataFrame(coords)

Unnamed: 0,DE(TenneT GER),DE(Amprion),DE(50Hertz),DE(TransnetBW),ES,FR,IT
0,"54.4945,9.418","52.1981,7.1708","53.7689,13.3863","48.7636,9.1709","41.8038,-5.6495","48.78,5.98","42.36,11.5219"
1,"53.2286,8.2073","51.2419,6.2436","52.6166,12.3239","47.8104,7.8306","37.8525,-6.2658","46.865,2.7855","41.1309,14.0218"
2,"49.6208,10.5941","50.7588,8.4083","52.391,14.1889","47.83,9.4192","40.1282,-3.4962","44.7255,-0.8169","40.8399,17.1639"
3,"48.3685,10.9753","49.5538,8.1645","51.4889,12.2486","49.1983,8.5981","38.393,-1.211","44.5016,2.7085","39.124,16.3936"
4,"48.6645,13.0427","47.9251,10.3095","51.5328,14.5256","48.7317,9.8487","41.3765,-0.3761",446,"37.7109,13.4931"


<img src="images/coordinates_sat.png" height= "250" width="700">

5 locations were selected for each operator - Germany because it has more operators, has 4 times as many coordinates. A production system should have points allocated in proportion to the area being covered, but I've kept it uniform here to demonstrate the impact of coordinate density on accuracy.

### Pull

The [Dark Sky API](https://darksky.net/dev) provides forecasts as well as historical weather data. They unfortunately don't store forecasts made in the past. A historical API call returns the observed hour-by-hour weather and daily weather conditions for a particular date.

The WeatherPuller class pulls weather data (historical observations or forecasts) from the Dark Sky API. I pulled historical data for 2016, 2017, and 2018, for each of the 5 coordinates across the 7 operators. A call for a single date gets you data for all 24 hours. You can take a look at the available arguments and default values by typing `?wp.WeatherPuller`. 

Usage is shown below. 

In [16]:
wpl = wp.WeatherPuller('hist', '2018-11-14 05:00:00', '2018-11-15 05:00:00')
wpl.pull()

api_key: ········
71%
Written


The forecast functionality isn't necessary for the purposes of this experiment, but I've included it here in case you want to look at how well the distributed model does with forecast weather data instead of actuals. The script below will have to be set up to run at a certain time everyday.

In [17]:
wpl = wp.WeatherPuller('fc')
wpl.pull()

api_key: ········
Written


### Parse

The weather_parser module parses the json output by the Dark Sky API. The WeatherRecordParser class parses an individual record - it handles timezone conversions, implements error handling, and adds useful data such as the operator, the nearest weather station, and time of the pull. It returns dataframes detailing both hourly and daily weather conditions. The WeatherParser class is a wrapper that parses all records. 

This takes about 10 minutes to run. Forecast data can be parsed as well by passing the 'fc' argument.

In [18]:
wph = wpr.WeatherParser('hist')
wph.compile_wdata()

Parsed 10000 records
Parsed 20000 records
Parsed 30000 records
Processed


In [19]:
print(wph.daily_weather.shape)
wph.daily_weather.head(3)

(36750, 57)


Unnamed: 0,apparentTemperatureHigh,apparentTemperatureHighTime,apparentTemperatureLow,apparentTemperatureLowTime,apparentTemperatureMax,apparentTemperatureMaxTime,apparentTemperatureMin,apparentTemperatureMinTime,cloudCover,cloudCoverError,coord,dewPoint,humidity,icon,local_time,moonPhase,near_st,operator,ozone,precipAccumulation,precipIntensity,precipIntensityError,precipIntensityMax,precipIntensityMaxError,precipIntensityMaxTime,precipProbability,precipType,pressure,pressureError,pulltime,summary,sunrise,sunriseTime,sunset,sunsetTime,temperatureHigh,temperatureHighError,temperatureHighTime,temperatureLow,temperatureLowError,temperatureLowTime,temperatureMax,temperatureMaxError,temperatureMaxTime,temperatureMin,temperatureMinError,temperatureMinTime,time,uvIndex,uvIndexTime,visibility,windBearing,windBearingError,windGust,windGustTime,windSpeed,windSpeedError
0,31.55,1496325600,16.54,1496383200,31.55,1496325600,15.04,1496293200,0.44,0.11,1,11.08,0.5,partly-cloudy-day,2017-06-01 00:00:00,0.24,32.636,ES,,,,,,,,,rain,1015.21,,2017-06-01 05:00:00,Partly cloudy throughout the day.,2017-06-01 06:51:36,1496292696,2017-06-01 21:52:23,1496346743,31.55,,1496325600,16.54,,1496383200,31.55,,1496325600,15.04,,1496293200,1496268000,9,1496318400,,56.0,,,,0.77,
0,28.4,1496329200,17.6,1496383200,28.4,1496329200,17.46,1496296800,0.39,0.11,2,12.39,0.53,partly-cloudy-day,2017-06-01 00:00:00,0.24,10.0,ES,,,,,,,,,rain,1015.78,10.64,2017-06-01 05:00:00,Partly cloudy throughout the day.,2017-06-01 07:06:02,1496293562,2017-06-01 21:42:52,1496346172,28.4,3.83,1496329200,17.6,3.84,1496383200,28.4,3.83,1496329200,17.46,3.83,1496296800,1496268000,8,1496318400,,176.0,63.75,,,1.77,3.6
0,31.09,1496332800,19.7,1496379600,31.09,1496332800,17.18,1496293200,0.24,,3,10.82,0.45,partly-cloudy-night,2017-06-01 00:00:00,0.24,25.466,ES,,,0.0,,0.0,,,0.0,,1015.55,,2017-06-01 05:00:00,Partly cloudy overnight.,2017-06-01 06:48:16,1496292496,2017-06-01 21:38:29,1496345909,31.09,,1496332800,19.7,,1496379600,31.09,,1496332800,17.18,,1496293200,1496268000,9,1496318400,10.01,309.0,,5.32,1496336000.0,0.41,


In [20]:
print(wph.hourly_weather.shape)
wph.hourly_weather.head(3)

(882000, 31)


Unnamed: 0,apparentTemperature,cloudCover,cloudCoverError,coord,dewPoint,humidity,icon,local_time,near_st,operator,ozone,precipAccumulation,precipIntensity,precipIntensityError,precipProbability,precipProbabilityError,precipType,pressure,pressureError,pulltime,summary,temperature,temperatureError,time,uvIndex,visibility,windBearing,windBearingError,windGust,windSpeed,windSpeedError
0,22.94,0.45,0.11,1,10.19,0.44,partly-cloudy-night,2017-06-01 00:00:00,32.636,ES,,,,,,,rain,1015.92,,2017-06-01 05:00:00,Partly Cloudy,22.94,,1496268000,0,,337.0,,,1.0,
1,20.44,0.44,0.11,1,11.49,0.56,partly-cloudy-night,2017-06-01 01:00:00,32.636,ES,,,,,,,rain,1016.81,,2017-06-01 05:00:00,Partly Cloudy,20.44,,1496271600,0,,0.0,,,3.1,
2,19.04,0.43,0.11,1,11.3,0.61,partly-cloudy-night,2017-06-01 02:00:00,32.636,ES,,,,,,,rain,1017.2,,2017-06-01 05:00:00,Partly Cloudy,19.04,,1496275200,0,,90.0,,,2.1,


### Clean and Transform

Given the high proportion of missing values, a significant amount of clean-up is required. Simply replacing values with zeroes, medians, or forward filling will actively mislead predictive algorithms, since each of those values says something about the weather, which is variable to begin with. Values have to be interpolated carefully.

In [21]:
wph.hourly_weather.isnull().sum()

apparentTemperature            0
cloudCover                     0
cloudCoverError           516074
coord                          0
dewPoint                       0
humidity                       0
icon                           0
local_time                     0
near_st                        0
operator                       0
ozone                     837775
precipAccumulation        873409
precipIntensity           168385
precipIntensityError      820534
precipProbability         168385
precipProbabilityError    820534
precipType                554376
pressure                       0
pressureError             456044
pulltime                       0
summary                        0
temperature                    0
temperatureError          755054
time                           0
uvIndex                        0
visibility                257351
windBearing                28254
windBearingError          751182
windGust                  291448
windSpeed                      0
windSpeedE

The WeatherCleaner class cleans and transforms parsed weather data. Extraneous columns like error metrics, ozone, and summary are removed. Null values for precipitation and wind are incrementally interpolated based on operator, coordinate, and time of year. The data is transformed from long to wide format to facilitate exploration and modeling. The coordinate is prepended to each column.

This class can handle both daily and hourly data, for both historical observations and forecasts ('fc').

In [22]:
wch = wpr.WeatherCleaner('hist', wph.daily_weather, 'daily')
wch.clean()
print(wch.weatherdata.shape)
wch.weatherdata.head(3)

Written
(7350, 178)


Unnamed: 0,1_apparentTemperatureHigh,1_apparentTemperatureHighTime,1_apparentTemperatureLow,1_apparentTemperatureLowTime,1_apparentTemperatureMax,1_apparentTemperatureMaxTime,1_apparentTemperatureMin,1_apparentTemperatureMinTime,1_cloudCover,1_dewPoint,1_humidity,1_icon,1_moonPhase,1_near_st,1_precipAccumulation,1_precipIntensity,1_precipProbability,1_precipType,1_pressure,1_pulltime,1_sunrise,1_sunset,1_temperatureHigh,1_temperatureHighTime,1_temperatureLow,1_temperatureLowTime,1_temperatureMax,1_temperatureMaxTime,1_temperatureMin,1_temperatureMinTime,1_uvIndex,1_uvIndexTime,1_visibility,1_windBearing,1_windSpeed,2_apparentTemperatureHigh,2_apparentTemperatureHighTime,2_apparentTemperatureLow,2_apparentTemperatureLowTime,2_apparentTemperatureMax,2_apparentTemperatureMaxTime,2_apparentTemperatureMin,2_apparentTemperatureMinTime,2_cloudCover,2_dewPoint,2_humidity,2_icon,2_moonPhase,2_near_st,2_precipAccumulation,2_precipIntensity,2_precipProbability,2_precipType,2_pressure,2_pulltime,2_sunrise,2_sunset,2_temperatureHigh,2_temperatureHighTime,2_temperatureLow,2_temperatureLowTime,2_temperatureMax,2_temperatureMaxTime,2_temperatureMin,2_temperatureMinTime,2_uvIndex,2_uvIndexTime,2_visibility,2_windBearing,2_windSpeed,3_apparentTemperatureHigh,3_apparentTemperatureHighTime,3_apparentTemperatureLow,3_apparentTemperatureLowTime,3_apparentTemperatureMax,3_apparentTemperatureMaxTime,3_apparentTemperatureMin,3_apparentTemperatureMinTime,3_cloudCover,3_dewPoint,3_humidity,3_icon,3_moonPhase,3_near_st,3_precipAccumulation,3_precipIntensity,3_precipProbability,3_precipType,3_pressure,3_pulltime,3_sunrise,3_sunset,3_temperatureHigh,3_temperatureHighTime,3_temperatureLow,3_temperatureLowTime,3_temperatureMax,3_temperatureMaxTime,3_temperatureMin,3_temperatureMinTime,3_uvIndex,3_uvIndexTime,3_visibility,3_windBearing,3_windSpeed,4_apparentTemperatureHigh,4_apparentTemperatureHighTime,4_apparentTemperatureLow,4_apparentTemperatureLowTime,4_apparentTemperatureMax,4_apparentTemperatureMaxTime,4_apparentTemperatureMin,4_apparentTemperatureMinTime,4_cloudCover,4_dewPoint,4_humidity,4_icon,4_moonPhase,4_near_st,4_precipAccumulation,4_precipIntensity,4_precipProbability,4_precipType,4_pressure,4_pulltime,4_sunrise,4_sunset,4_temperatureHigh,4_temperatureHighTime,4_temperatureLow,4_temperatureLowTime,4_temperatureMax,4_temperatureMaxTime,4_temperatureMin,4_temperatureMinTime,4_uvIndex,4_uvIndexTime,4_visibility,4_windBearing,4_windSpeed,5_apparentTemperatureHigh,5_apparentTemperatureHighTime,5_apparentTemperatureLow,5_apparentTemperatureLowTime,5_apparentTemperatureMax,5_apparentTemperatureMaxTime,5_apparentTemperatureMin,5_apparentTemperatureMinTime,5_cloudCover,5_dewPoint,5_humidity,5_icon,5_moonPhase,5_near_st,5_precipAccumulation,5_precipIntensity,5_precipProbability,5_precipType,5_pressure,5_pulltime,5_sunrise,5_sunset,5_temperatureHigh,5_temperatureHighTime,5_temperatureLow,5_temperatureLowTime,5_temperatureMax,5_temperatureMaxTime,5_temperatureMin,5_temperatureMinTime,5_uvIndex,5_uvIndexTime,5_visibility,5_windBearing,5_windSpeed,operator,time,localtime
0,31.55,1496325600,16.54,1496383200,31.55,1496325600,15.04,1496293200,0.44,11.08,0.5,partly-cloudy-day,0.24,32.636,0.0,0.0,0.0,rain,1015.21,2017-06-01 05:00:00,65136,215223,31.55,1496325600,16.54,1496383200,31.55,1496325600,15.04,1496293200,9,1496318400,16.09,56.0,0.77,28.4,1496329200,17.6,1496383200,28.4,1496329200,17.46,1496296800,0.39,12.39,0.53,partly-cloudy-day,0.24,10.0,0.0,0.0,0.0,rain,1015.78,2017-06-01 05:00:00,70602,214252,28.4,1496329200,17.6,1496383200,28.4,1496329200,17.46,1496296800,8,1496318400,16.09,176.0,1.77,31.09,1496332800,19.7,1496379600,31.09,1496332800,17.18,1496293200,0.24,10.82,0.45,partly-cloudy-night,0.24,25.466,0.0,0.0,0.0,,1015.55,2017-06-01 05:00:00,64816,213829,31.09,1496332800,19.7,1496379600,31.09,1496332800,17.18,1496293200,9,1496318400,10.01,309.0,0.41,25.01,1496318400,15.11,1496379600,25.01,1496318400,15.69,1496293200,0.33,15.52,0.77,partly-cloudy-night,0.24,12.833,0.0,0.0,0.0,,1018.67,2017-06-01 05:00:00,64416,212412,25.01,1496318400,15.1,1496379600,25.01,1496318400,15.48,1496293200,10,1496318400,9.93,220.0,1.1,24.49,1496329200,15.67,1496379600,24.49,1496329200,15.51,1496293200,0.48,9.62,0.54,partly-cloudy-day,0.24,10.0,0.0,0.0305,0.56,rain,1015.28,2017-06-01 05:00:00,63153,212954,24.49,1496329200,15.67,1496379600,24.49,1496329200,15.51,1496293200,7,1496314800,16.09,125.0,2.36,ES,1496268000,2017-06-01 00:00:00
0,27.12,1496325600,14.34,1496376000,27.12,1496325600,12.46,1496286000,0.58,13.07,0.66,partly-cloudy-day,0.24,0.399,0.0,0.0,0.0,,1019.91,2017-06-01 05:00:00,53846,213209,27.06,1496325600,14.34,1496376000,27.06,1496325600,12.46,1496286000,6,1496311200,10.01,286.0,0.58,26.48,1496332800,14.86,1496376000,26.48,1496332800,14.36,1496289600,0.73,15.23,0.74,partly-cloudy-day,0.24,22.998,0.0,0.0,0.0,,1019.4,2017-06-01 05:00:00,55936,213653,26.48,1496332800,14.81,1496376000,26.48,1496332800,14.36,1496289600,7,1496314800,9.82,175.0,1.48,27.78,1496322000,16.53,1496379600,27.78,1496322000,14.59,1496293200,0.47,16.1,0.75,partly-cloudy-day,0.24,13.139,0.0,0.0,0.0,,1018.78,2017-06-01 05:00:00,62213,214306,26.87,1496322000,16.46,1496379600,26.87,1496322000,14.59,1496293200,8,1496318400,9.12,23.0,1.47,24.08,1496329200,13.09,1496376000,24.08,1496329200,12.02,1496289600,0.45,12.57,0.73,partly-cloudy-day,0.24,20.611,0.0,0.0025,0.19,rain,1019.16,2017-06-01 05:00:00,60855,212811,24.08,1496329200,13.09,1496376000,24.08,1496329200,12.02,1496289600,8,1496314800,8.42,277.0,0.78,23.41,1496322000,10.7,1496376000,23.41,1496322000,10.7,1496289600,0.69,11.31,0.73,partly-cloudy-day,0.24,7.45,0.0,0.0889,0.11,rain,1020.12,2017-06-01 05:00:00,55733,211313,23.41,1496322000,10.7,1496376000,23.41,1496322000,10.7,1496289600,7,1496311200,16.09,324.0,0.98,FR,1496268000,2017-06-01 00:00:00
0,27.62,1496325600,13.63,1496376000,27.62,1496325600,15.61,1496289600,0.38,9.89,0.49,partly-cloudy-day,0.24,29.268,0.0,0.0,0.0,,1016.54,2017-06-01 05:00:00,54105,204531,27.62,1496325600,13.63,1496376000,27.62,1496325600,15.61,1496289600,8,1496318400,10.01,102.0,0.98,26.13,1496325600,18.87,1496376000,26.13,1496325600,18.88,1496282400,0.3,16.93,0.71,partly-cloudy-night,0.24,9.165,0.0,0.0,0.0,,1016.79,2017-06-01 05:00:00,53504,203131,25.96,1496325600,18.87,1496376000,25.96,1496325600,18.88,1496282400,9,1496314800,9.91,134.0,1.1,27.48,1496322000,17.57,1496365200,27.48,1496322000,16.46,1496286000,0.4,13.26,0.58,partly-cloudy-day,0.24,21.015,0.0,0.0,0.0,,1020.25,2017-06-01 05:00:00,52325,201802,27.48,1496322000,17.57,1496365200,27.48,1496322000,16.46,1496286000,9,1496311200,10.01,338.0,0.41,20.91,1496318400,13.84,1496372400,20.91,1496318400,12.54,1496278800,0.35,9.33,0.6,partly-cloudy-day,0.24,22.973,0.0,0.0,0.0,,1020.38,2017-06-01 05:00:00,53143,201554,20.91,1496318400,13.84,1496372400,20.91,1496318400,12.54,1496278800,9,1496314800,10.01,208.0,1.68,23.76,1496318400,16.64,1496372400,23.76,1496318400,15.53,1496271600,0.42,11.42,0.6,partly-cloudy-day,0.24,5.33,0.0,0.0,0.0,,1018.17,2017-06-01 05:00:00,54723,202326,23.76,1496318400,16.64,1496372400,23.76,1496318400,15.53,1496271600,8,1496311200,16.09,68.0,0.54,IT,1496268000,2017-06-01 00:00:00


In [23]:
wch = wpr.WeatherCleaner('hist', wph.hourly_weather, 'hourly')
wch.clean()
print(wch.weatherdata.shape)
wch.weatherdata.head(3)

Written
(176400, 88)


Unnamed: 0,1_apparentTemperature,1_cloudCover,1_dewPoint,1_humidity,1_icon,1_near_st,1_precipAccumulation,1_precipIntensity,1_precipProbability,1_precipType,1_pressure,1_pulltime,1_temperature,1_uvIndex,1_visibility,1_windBearing,1_windSpeed,2_apparentTemperature,2_cloudCover,2_dewPoint,2_humidity,2_icon,2_near_st,2_precipAccumulation,2_precipIntensity,2_precipProbability,2_precipType,2_pressure,2_pulltime,2_temperature,2_uvIndex,2_visibility,2_windBearing,2_windSpeed,3_apparentTemperature,3_cloudCover,3_dewPoint,3_humidity,3_icon,3_near_st,3_precipAccumulation,3_precipIntensity,3_precipProbability,3_precipType,3_pressure,3_pulltime,3_temperature,3_uvIndex,3_visibility,3_windBearing,3_windSpeed,4_apparentTemperature,4_cloudCover,4_dewPoint,4_humidity,4_icon,4_near_st,4_precipAccumulation,4_precipIntensity,4_precipProbability,4_precipType,4_pressure,4_pulltime,4_temperature,4_uvIndex,4_visibility,4_windBearing,4_windSpeed,5_apparentTemperature,5_cloudCover,5_dewPoint,5_humidity,5_icon,5_near_st,5_precipAccumulation,5_precipIntensity,5_precipProbability,5_precipType,5_pressure,5_pulltime,5_temperature,5_uvIndex,5_visibility,5_windBearing,5_windSpeed,operator,time,localtime
0,22.94,0.45,10.19,0.44,partly-cloudy-night,32.636,0.0,0.0,0.0,rain,1015.92,2017-06-01 05:00:00,22.94,0,13.82,337.0,1.0,23.43,0.33,13.07,0.52,partly-cloudy-night,10.0,0.0,0.0,0.0,rain,1015.9,2017-06-01 05:00:00,23.43,0,16.09,122.0,2.49,23.81,0.75,10.34,0.43,partly-cloudy-night,25.466,0.0,0.0,0.0,,1016.4,2017-06-01 05:00:00,23.81,0,10.01,131.0,2.09,17.84,0.33,17.0,0.97,partly-cloudy-night,12.833,0.0,0.0,0.0,,1019.61,2017-06-01 05:00:00,17.51,0,10.01,257.0,1.34,18.23,0.41,9.67,0.57,partly-cloudy-night,10.0,0.0,0.0,0.0,rain,1016.07,2017-06-01 05:00:00,18.23,0,16.09,94.0,2.35,ES,1496268000,2017-06-01 00:00:00
1,20.44,0.44,11.49,0.56,partly-cloudy-night,32.636,0.0,0.0,0.0,rain,1016.81,2017-06-01 05:00:00,20.44,0,13.82,0.0,3.1,22.76,0.31,13.08,0.54,partly-cloudy-night,10.0,0.0,0.0,0.0,rain,1016.2,2017-06-01 05:00:00,22.76,0,16.09,162.0,2.14,21.64,0.75,12.33,0.55,partly-cloudy-night,25.466,0.0,0.0,0.0,,1017.2,2017-06-01 05:00:00,21.64,0,10.01,247.0,1.72,17.77,0.3,17.16,0.98,partly-cloudy-night,12.833,0.0,0.0,0.0,,1019.89,2017-06-01 05:00:00,17.41,0,10.01,297.0,0.61,17.62,0.39,9.67,0.6,partly-cloudy-night,10.0,0.0,0.0,0.0,rain,1016.41,2017-06-01 05:00:00,17.62,0,16.09,91.0,2.37,ES,1496271600,2017-06-01 01:00:00
2,19.04,0.43,11.3,0.61,partly-cloudy-night,32.636,0.0,0.0,0.0,rain,1017.2,2017-06-01 05:00:00,19.04,0,13.82,90.0,2.1,22.0,0.29,12.98,0.57,partly-cloudy-night,10.0,0.0,0.0,0.0,rain,1016.3,2017-06-01 05:00:00,22.0,0,16.09,195.0,1.81,20.49,0.38,12.58,0.6,partly-cloudy-night,25.466,0.0,0.0,0.0,,1018.01,2017-06-01 05:00:00,20.49,0,10.01,287.0,0.58,17.54,0.3,17.06,0.99,partly-cloudy-night,12.833,0.0,0.0,0.0,,1019.81,2017-06-01 05:00:00,17.17,0,10.01,349.0,0.24,17.12,0.37,9.62,0.61,partly-cloudy-night,10.0,0.0,0.0,0.0,rain,1016.51,2017-06-01 05:00:00,17.12,0,16.09,92.0,2.36,ES,1496275200,2017-06-01 02:00:00
