# Gathering Data

## Source Data Description

Data to be used in this work come from kaggle competition [Ashrae - Great Energy Predictor III](https://www.kaggle.com/c/ashrae-energy-prediction), where the goal is to predict the energy consumption in several buildings around different locations for the next two years. These are the files:


**train.csv**
* `building_id` - Foreign key for the building metadata.
* `meter` - The meter id code. Read as `{0: electricity, 1: chilledwater, 2: steam, 3: hotwater}`. Not every building has all meter types.
* `timestamp`  - When the measurement was taken
* `meter_reading` - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.

**building_meta.csv**
* `site_id` - Foreign key for the weather files.
* `building_id` - Foreign key for training.csv
* `primary_use` - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
* `square_feet` - Gross floor area of the building
* `year_built` - Year building was opened
* `floor_count` - Number of floors of the building

**weather_[train/test].csv**

Weather data from a meteorological station as close as possible to the site.

* `site_id`
* `air_temperature` - Degrees Celsius
* `cloud_coverage` - Portion of the sky covered in clouds, in oktas
* `dew_temperature` - Degrees Celsius
* `precip_depth_1_hr` - Millimeters
* `sea_level_pressure` - Millibar/hectopascals
* `wind_direction` - Compass direction (0-360)
* `wind_speed` - Meters per second

**test.csv**

The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order.

* `row_id` - Row id for your submission file
* `building_id` - Building id code
* `meter` - The meter id code
* `timestamp` - Timestamps for the test data period


For the pourpose of this Master Thesis, we will initialy only consider and perform the analysis and modeling for one location. We need to import the very raw data first and perform a slight first exploratory analysis in order to choose the site. We'll look for the site with the highest level of data quality, trying to avoid as far as possible missing values, inconsistencies, etc., since the aim of this work is much about aplying several ML algorithms, rather than cleaning data.

In [6]:
# Load libraries

%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import datetime as dt
import os
import gc
from src.functions import data_import as dimp
from src.functions import data_exploration as dexp
import pandas_profiling

import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.set_config_file(offline=True)

## Data import

In order to avoid memory problems because of the size of the data sets, we're using the funciton `import_data` from our local library `utils`, that considerably reduces the size of data frames.

In [10]:
train = dimp.import_data('../../data/raw/train.csv') 

Memory usage of dataframe is 616.95 MB
Memory usage after optimization is: 173.90 MB
Decreased by 71.8%


In [2]:
building_meta = dimp.import_data('../../data/raw/building_metadata.csv')

Memory usage of dataframe is 0.07 MB
Memory usage after optimization is: 0.02 MB
Decreased by 73.8%


In [9]:
weather_train = dimp.import_data('../../data/raw/weather_train.csv')

Memory usage of dataframe is 9.60 MB
Memory usage after optimization is: 2.65 MB
Decreased by 72.4%


In [4]:
weather_test = dimp.import_data('../../data/raw/weather_test.csv')

Memory usage of dataframe is 19.04 MB
Memory usage after optimization is: 5.25 MB
Decreased by 72.4%


In [79]:
test = dimp.import_data('../../data/raw/test.csv')

Memory usage of dataframe is 1272.51 MB
Memory usage after optimization is: 358.65 MB
Decreased by 71.8%


In [None]:
# function to import data
dimp.load_data('../../data/raw/')

## Data set profiles

We are using the library `pandas_profiling` for some data sets (those with shortest size) to get a report with a general exploratory data analysis that we'll use to decide the site. 

### Data set `weather_train`

In [6]:
weather_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139773 entries, 0 to 139772
Data columns (total 9 columns):
site_id               139773 non-null int8
timestamp             139773 non-null category
air_temperature       139718 non-null float16
cloud_coverage        70600 non-null float16
dew_temperature       139660 non-null float16
precip_depth_1_hr     89484 non-null float16
sea_level_pressure    129155 non-null float16
wind_direction        133505 non-null float16
wind_speed            139469 non-null float16
dtypes: category(1), float16(7), int8(1)
memory usage: 2.6 MB


Variable `timestamp` is categorical. Let's convert it to datetime.

In [7]:
weather_train['timestamp'] = pd.to_datetime(weather_train['timestamp'])

In [8]:
weather_train.head()

Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.5,0.0,0.0
1,0,2016-01-01 01:00:00,24.40625,,21.09375,-1.0,1020.0,70.0,1.5
2,0,2016-01-01 02:00:00,22.796875,2.0,21.09375,0.0,1020.0,0.0,0.0
3,0,2016-01-01 03:00:00,21.09375,2.0,20.59375,0.0,1020.0,0.0,0.0
4,0,2016-01-01 04:00:00,20.0,2.0,20.0,-1.0,1020.0,250.0,2.599609


#### Reports by `site_id`

In [9]:
sites = weather_train['site_id'].unique()

In [11]:
weather_train_reports = dexp.export_reports(weather_train, sites)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


In [14]:
weather_train.name = 'wather_train'

In [16]:
dexp.export_reports(weather_train, weather_train_reports, '../../data/external/weather_train_profiles/')

### Data set `weather_test` 

In [21]:
weather_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277243 entries, 0 to 277242
Data columns (total 9 columns):
site_id               277243 non-null int8
timestamp             277243 non-null datetime64[ns]
air_temperature       277139 non-null float16
cloud_coverage        136795 non-null float16
dew_temperature       276916 non-null float16
precip_depth_1_hr     181655 non-null float16
sea_level_pressure    255978 non-null float16
wind_direction        264873 non-null float16
wind_speed            276783 non-null float16
dtypes: datetime64[ns](1), float16(7), int8(1)
memory usage: 6.1 MB


`timestamp` is categorical, let's convert it to datetime.

In [20]:
weather_test['timestamp'] = pd.to_datetime(weather_test['timestamp'])

#### Reports by `site_id`

In [23]:
weather_test.name = 'weather_test'
weather_test_reports = dexp.get_report_by_site(weather_test, weather_test['site_id'].unique())
dexp.export_reports(weather_test, weather_test_reports, '../../data/external/weather_test_profiles/')

### Data set `building_metadata` 

In [8]:
building_meta.head()

Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count
0,0,0,Education,7432,2008.0,
1,0,1,Education,2720,2004.0,
2,0,2,Education,5376,1991.0,
3,0,3,Education,23685,2002.0,
4,0,4,Education,116607,1975.0,


In [27]:
building_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1449 entries, 0 to 1448
Data columns (total 6 columns):
site_id        1449 non-null int8
building_id    1449 non-null int16
primary_use    1449 non-null category
square_feet    1449 non-null int32
year_built     675 non-null float16
floor_count    355 non-null float16
dtypes: category(1), float16(2), int16(1), int32(1), int8(1)
memory usage: 17.9 KB


Features `year_built` and `floor_count`  are `float` type. We're casting it to integers.

In [None]:
building_meta_df['year_built'] = pd.array(building_meta_df['year_built'], dtype=pd.Int32Dtype())
building_meta_df['floor_count'] = pd.array(building_meta_df['floor_count'], dtype=pd.Int32Dtype())

#### Reports by `site_id`

In [None]:
building_meta.name = 'building_metada'
building_meta_reports = dexp.get_report_by_site(building_meta, building_meta['site_id'].unique())
dexp.export_reports(building_meta, building_meta_reports, '../../data/external/building_meta_profiles/')

### `train` data set

In [11]:
train.head()

Unnamed: 0,building_id,meter,timestamp,meter_reading
0,0,0,2016-01-01 00:00:00,0.0
1,1,0,2016-01-01 00:00:00,0.0
2,2,0,2016-01-01 00:00:00,0.0
3,3,0,2016-01-01 00:00:00,0.0
4,4,0,2016-01-01 00:00:00,0.0


In [13]:
train.isna().sum()

building_id      0
meter            0
timestamp        0
meter_reading    0
dtype: int64

In [14]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20216100 entries, 0 to 20216099
Data columns (total 4 columns):
building_id      int16
meter            int8
timestamp        category
meter_reading    float32
dtypes: category(1), float32(1), int16(1), int8(1)
memory usage: 173.9 MB


## Data selection

After checking the data sets profile, the decision is to select `site_id 1` location. In the following lines, we're selecting the rows of each data set for this location and exporting the resulting data frames to `csv` files. We will use those files as our starting data sets for this work.

In [54]:
weather_train[weather_train['site_id']==1].to_csv('../../data/interim/site_1/weather_train.csv', index=False)
weather_test[weather_test['site_id']==1].to_csv('../../data/interim/site_1/weather_test.csv', index=False)
building_meta[building_meta['site_id']==1].to_csv('../../data/interim/site_1/building_metada.csv', index=False)

There's no `site_id` foreign key in `train` set, the key is `building_id` so we need to extract all the the rows where the `building_id` is in the building ids list for site 1.

In [59]:
buildings_site_1 = list(building_meta.loc[building_meta['site_id']==1, 'building_id'])

In [70]:
train_site_1 = train[train['building_id']==buildings_site_1[0]]

for i in range(1,len(buildings_site_1)):
    train_site_1 = train_site_1.append(train[train['building_id']==buildings_site_1[i]])
    

In [78]:
train_site_1.to_csv('../../data/interim/site_1/train.csv', index=False)

In [82]:
del(weather_train, building_meta, train)

Same for `test` set:

In [83]:
test_site_1 = test[test['building_id']==buildings_site_1[0]]

for i in range(1,len(buildings_site_1)):
    test_site_1 = test_site_1.append(test[test['building_id']==buildings_site_1[i]])
    

In [85]:
test_site_1.to_csv('../../data/interime/site_1/test.csv', index=False)

In [7]:
x = np.nan

In [8]:
x

nan

In [9]:
type(x)

float