# Gathering Data

## Source Data Description

Data to be used in this work come from kaggle competition [Ashrae - Great Energy Predictor III](https://www.kaggle.com/c/ashrae-energy-prediction), where the goal is to predict the energy consumption in several buildings around different locations for the next two years. These are the files:


**train.csv**
* `building_id` - Foreign key for the building metadata.
* `meter` - The meter id code. Read as `{0: electricity, 1: chilledwater, 2: steam, 3: hotwater}`. Not every building has all meter types.
* `timestamp`  - When the measurement was taken
* `meter_reading` - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.

**building_meta.csv**
* `site_id` - Foreign key for the weather files.
* `building_id` - Foreign key for training.csv
* `primary_use` - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
* `square_feet` - Gross floor area of the building
* `year_built` - Year building was opened
* `floor_count` - Number of floors of the building

**weather_[train/test].csv**

Weather data from a meteorological station as close as possible to the site.

* `site_id`
* `air_temperature` - Degrees Celsius
* `cloud_coverage` - Portion of the sky covered in clouds, in oktas
* `dew_temperature` - Degrees Celsius
* `precip_depth_1_hr` - Millimeters
* `sea_level_pressure` - Millibar/hectopascals
* `wind_direction` - Compass direction (0-360)
* `wind_speed` - Meters per second

**test.csv**

The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order.

* `row_id` - Row id for your submission file
* `building_id` - Building id code
* `meter` - The meter id code
* `timestamp` - Timestamps for the test data period


For the pourpose of this Master Thesis, we will initialy only consider and perform the analysis and modeling for one location. We need to import the very raw data first and perform a slight first exploratory analysis in order to choose the site. We'll look for the site with the highest level of data quality, trying to avoid as far as possible missing values, inconsistencies, etc., since the aim of this work is much about aplying several ML algorithms, rather than cleaning data.

In [2]:
# Load libraries

%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import datetime as dt
import os
import gc
from src.functions import data_import as dimp
from src.functions import data_exploration as dexp
import pandas_profiling

import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.set_config_file(offline=True)

## Data import

In order to avoid memory problems because of the size of the data sets, we're using the funciton `import_data` from our local library `utils`, that considerably reduces the size of data frames.

In [None]:
train = dimp.import_data('../../data/raw/train.csv') 

In [None]:
building_meta = dimp.import_data('../../data/raw/building_metadata.csv')

In [3]:
weather_train = dimp.import_data('../../data/raw/weather_train.csv')

Memory usage of dataframe is 9.60 MB
Memory usage after optimization is: 2.65 MB
Decreased by 72.4%


In [None]:
weather_test = dimp.import_data('../../data/raw/weather_test.csv')

In [None]:
test = dimp.import_data('../../data/raw/test.csv')

In [None]:
# function to import data
dimp.load_data('../../data/raw/')

## Data set profiles

We are using the library `pandas_profiling` for some data sets (those with shortest size) to get a report with a general exploratory data analysis that we'll use to decide the site. 

### Data set `weather_train`

In [None]:
weather_train.info()

Variable `timestamp` is categorical. Let's convert it to datetime.

In [None]:
weather_train['timestamp'] = pd.to_datetime(weather_train['timestamp'])

In [None]:
weather_train.head()

#### Reports by `site_id`

In [None]:
sites = weather_train['site_id'].unique()

In [None]:
weather_train_reports = dexp.export_reports(weather_train, sites)

In [None]:
weather_train.name = 'wather_train'

In [None]:
dexp.export_reports(weather_train, weather_train_reports, '../../data/external/weather_train_profiles/')

### Data set `weather_test` 

In [None]:
weather_test.info()

`timestamp` is categorical, let's convert it to datetime.

In [None]:
weather_test['timestamp'] = pd.to_datetime(weather_test['timestamp'])

#### Reports by `site_id`

In [None]:
weather_test.name = 'weather_test'
weather_test_reports = dexp.get_report_by_site(weather_test, weather_test['site_id'].unique())
dexp.export_reports(weather_test, weather_test_reports, '../../data/external/weather_test_profiles/')

### Data set `building_metadata` 

In [None]:
building_meta.head()

In [None]:
building_meta.info()

Features `year_built` and `floor_count`  are `float` type. We're casting it to integers.

In [None]:
building_meta_df['year_built'] = pd.array(building_meta_df['year_built'], dtype=pd.Int32Dtype())
building_meta_df['floor_count'] = pd.array(building_meta_df['floor_count'], dtype=pd.Int32Dtype())

#### Reports by `site_id`

In [None]:
building_meta.name = 'building_metada'
building_meta_reports = dexp.get_report_by_site(building_meta, building_meta['site_id'].unique())
dexp.export_reports(building_meta, building_meta_reports, '../../data/external/building_meta_profiles/')

### `train` data set

In [None]:
train.head()

In [None]:
train.isna().sum()

In [None]:
train.info()

## Data selection

After checking the data sets profile, the decision is to select `site_id 1` location. In the following lines, we're selecting the rows of each data set for this location and exporting the resulting data frames to `csv` files. We will use those files as our starting data sets for this work.

In [None]:
weather_train[weather_train['site_id']==1].to_csv('../../data/interim/site_1/weather_train.csv', index=False)
weather_test[weather_test['site_id']==1].to_csv('../../data/interim/site_1/weather_test.csv', index=False)
building_meta[building_meta['site_id']==1].to_csv('../../data/interim/site_1/building_metada.csv', index=False)

There's no `site_id` foreign key in `train` set, the key is `building_id` so we need to extract all the the rows where the `building_id` is in the building ids list for site 1.

In [None]:
buildings_site_1 = list(building_meta.loc[building_meta['site_id']==1, 'building_id'])

In [None]:
train_site_1 = train[train['building_id']==buildings_site_1[0]]

for i in range(1,len(buildings_site_1)):
    train_site_1 = train_site_1.append(train[train['building_id']==buildings_site_1[i]])
    

In [None]:
train_site_1.to_csv('../../data/interim/site_1/train.csv', index=False)

In [None]:
del(weather_train, building_meta, train)

Same for `test` set:

In [None]:
test_site_1 = test[test['building_id']==buildings_site_1[0]]

for i in range(1,len(buildings_site_1)):
    test_site_1 = test_site_1.append(test[test['building_id']==buildings_site_1[i]])
    

In [None]:
test_site_1.to_csv('../../data/interime/site_1/test.csv', index=False)